קריירה בהייטק עם PEAK – גלו את ההזדמנות הבאה שלכם!

הגש מועמדות למשרה זו

Site Reliability Engineer (SRE) – JB-1475

Tel Aviv | Hybrid | Full-Time

Share

We’re looking for an SRE / Reliability Engineer to design and build reliability systems for multi-cluster Kubernetes environments. You’ll focus on self-healing infrastructure, failover automation, incident response workflows, and modern observability—helping define what reliability means for AI-driven cloud systems.

What you’ll do

  • Design reliability systems for multi-cluster Kubernetes environments
  • Build self-healing, failover, and incident-response automation using Argo Workflows + Temporal
  • Define, measure, and continuously improve SLOs, SLIs, and reliability metrics
  • Own observability operations with Prometheus, Grafana, Loki, Tempo
  • Implement incident playbooks and automation via ChatOps
  • Partner with developers to bake resilience and performance into applications and services

Requirements

  • Strong understanding of Kubernetes, container orchestration, and automation patterns
  • Familiarity with Terraform/Terragrunt and GitOps
  • Comfort with observability stacks: Prometheus, Grafana, Loki, Tempo
  • Proficiency in Python or Go for tooling and automation
  • Excited to apply AI + automation to reliability engineering