We are looking for a Site Reliability Engineering Manager to lead our Israel SRE team.
In this role, you'll drive best practices in reliability engineering, ensuring the stability, availability, and performance of our SaaS services. You'll collaborate with global SRE leaders, refine processes, and foster a culture of accountability and continuous improvement.
As a Site Reliability Engineering Manager you will
Lead, mentor, and develop a high-performing SRE Israel team, fostering collaboration, innovation, and accountability
Ensure SaaS reliability, performance, and availability, meeting or exceeding service-level objectives
Drive SRE best practices, including capacity planning, incident management, chaos engineering, and disaster recovery
Implement proactive monitoring, alerting, and anomaly detection aligned with SaaS standards
Collaborate with P&E and Cloud engineering teams to embed reliability into the SDLC
Oversee incident management, ensuring swift identification, escalation, and resolution
Maintain comprehensive SRE documentation, including processes, incident reports, and system architecture
Evaluate and adopt tools, technologies, and methodologies to enhance uptime and reliability.
Requirements: 3+ years of management experience leading a team of SRE, DevOps, or a similar SaaS role
Bachelors degree in Computer Science, Engineering, or related field (or equivalent experience)
Strong expertise in cloud platforms (AWS, GCP, or Azure), containers (Kubernetes, Docker), and configuration management (Terraform, Ansible)
Proficiency in Python or Go for automation and system optimization, as well as GitOps experience with SCM tools (e.g., Git, Bitbucket)
Strong leadership, communication, and collaboration skills, working across globally distributed teams
Familiarity with Agile methodologies, CI/CD pipelines, and orchestration tools (Jenkins, ArgoCD, StackStorm)
Familiarity with Chaos Engineering (e.g., Gremlin, Litmus, Chaos Toolkit)
Hands-on with alerting & observability tools (e.g., PagerDuty, OpsGenie, New Relic, Coralogix)
Strong understanding of scalability, high availability, and security best practices in cloud & Kubernetes environments.
This position is open to all candidates.