We’re looking for an SRE / Reliability Engineer to design and build reliability systems for multi-cluster Kubernetes environments. You’ll focus on self-healing infrastructure, failover automation, incident response workflows, and modern observability—helping define what reliability means for AI-driven cloud systems.
What you’ll do
- Design reliability systems for multi-cluster Kubernetes environments
- Build self-healing, failover, and incident-response automation using Argo Workflows + Temporal
- Define, measure, and continuously improve SLOs, SLIs, and reliability metrics
- Own observability operations with Prometheus, Grafana, Loki, Tempo
- Implement incident playbooks and automation via ChatOps
- Partner with developers to bake resilience and performance into applications and services
Requirements
- Strong understanding of Kubernetes, container orchestration, and automation patterns
- Familiarity with Terraform/Terragrunt and GitOps
- Comfort with observability stacks: Prometheus, Grafana, Loki, Tempo
- Proficiency in Python or Go for tooling and automation
- Excited to apply AI + automation to reliability engineering