babelforce is a Berlin-born CCaaS platform reshaping how businesses deliver customer service at scale. We help enterprise teams improve phone experiences through smart automation – reducing wait times, streamlining everyday tasks, and giving support agents more time for real conversations. From scaling support to boosting retention, babelforce empowers organizations to build flexible, future-ready customer service – without compromise.
As a Senior Site Reliability Engineer at babelforce, you will play a pivotal role in the reliability, security, and efficiency of our multi-tenant CCaaS platform. You’ll own core pieces of our AWS + Kubernetes stack, drive SLOs/SLA outcomes, and partner with product and engineering to turn operational excellence into a feature.
We are an exciting high-growth and dynamic startup where geekiness is encouraged and curiosity rewarded. You will thrive if you’re proactive, adaptable, an excellent communicator, and excited to keep learning and growing every day.
Tasks
- Designing, operating, and improving our Kubernetes (AWS EKS) production platform with a strong focus on high availability, performance, and cost efficiency.
- Owning GitOps workflows with ArgoCD, Helm, and Terraform across multiple AWS accounts and environments.
- Championing observability (metrics, logs, traces) and actionable alerting for services backed by AWS Aurora (MySQL/PostgreSQL), Redis, and NATS.
- Leading and evolving incident response: on-call participation, post-mortems, runbooks, and preventive engineering.
- Implementing security best practices and compliance-friendly processes.
- Owning backup/restore, disaster recovery, and region/AZ resilience strategies.
- Coaching engineers on cloud-native patterns, reliability mindset, and efficient use of the platform.
Requirements
Your profile
- Experience: 5–8+ years in SRE/Platform/Infra roles running production systems at scale on AWS and Kubernetes.
- IaC & GitOps: Expert with Terraform, Helm, ArgoCD; opinionated about reviewable, declarative, and repeatable delivery.
- Data layer: Practical ops knowledge of MySQ and PostgreSQL, Redis, NATS.
- Observability: Hands-on with metrics/logs/traces stacks (Prometheus, Grafana Loki, Tempo or equivalents), alerting, and SLO management.
- Reliability practices: Incident response, blameless post-mortems, capacity planning, cost/performance trade-offs.
- Soft skills: Clear written/spoken communication, product sense, bias for automation, and collaborative problem-solving across teams.
Nice to have
- Programming: Golang for platform automation/tooling/operators.
- Security: Experience hardening Kubernetes and AWS.
- VoIP: Knowledge of real-time/latency-sensitive systems (Asterisk, Kamailio).
Benefits
- Flexibility: hybrid/remote options and flexible hours.
- Trust & transparency: open communication, ownership, and a collaborative, inclusive culture.
- Growth: fewer layers than a large corporation – learn fast, ship work that matters, and shape your path.
We are committed to building a diverse and inclusive team. We welcome talented, compassionate people of all backgrounds and believe inclusivity strengthens our work and product.
Join a diverse company where your impact is immediate and opportunities are everywhere. With fewer boundaries than a big corporation, you’ll develop new skills fast and shape your own career path as we scale.