Were seeking a Staff Site Reliability Engineer to join our team and take ownership of the automation, reliability, and scalability of our platform. You will play a key role in refining our service architecture, CI/CD pipelines, and multi-cloud deployments, ensuring that our systems remain secure, efficient, and resilient. This role is ideal for a hands-on problem-solver who thrives in a fast-paced environment and is comfortable with ambiguity. We're looking for someone who can take ownership of a problem and drive solutions.
This is a hybrid role based in Tel Aviv, offering a chance to work closely with our founders and a team of highly talented individuals. Together, you will help shape the vision of our product and company while solving challenging infrastructure problems at scale.
What youll be doing
Design, build, and operate Okta's global production infrastructure
Support a highly available and large scale multi-cloud environment as part of an on-call rotation
Automate workflows, deployments, and infrastructure processes
Design, implement, and optimize CI/CD pipelines for faster and more reliable delivery
Refine and maintain multi-tenant service architecture with strong security and scalability
Deploy and manage infrastructure using Kubernetes, Terraform, and other modern IaC tools
Build robust monitoring, logging, and alerting systems to ensure platform reliability
Write automation and maintenance scripts, primarily in Python and Golang
Collaborate with engineering teams to improve developer experience and delivery speed
Lead by example, responding swiftly and efficiently to production incidents, and driving team learning and process improvements
Requirements: 7+ years of experience as a DevOps or Site Reliability engineer on high-scale distributed systems in Linux environments
Extensive experience with operating customer-facing production services
Strong experience in application and infrastructure monitoring in multi-tenant setups
Hands-on expertise in CI/CD pipeline design and implementation
Deep understanding of Kubernetes, its components, and operational flows
Proficiency in Golang and/or Python scripting for automation and maintenance
Extensive experience with Terraform (or equivalent Infrastructure as Code tools)
Highly motivated, self-learning, and passionate about improving infrastructure and delivery processes
A proven track record of successful SRE engagements, working closely with engineering teams
A proven ability to operate with a high degree of autonomy, taking ownership of complex, open-ended problems and driving solutions from ambiguous requirements to a robust implementation
A strong history of leading and mentoring other engineers, elevating the technical capabilities and operational excellence of the team
Experience collaborating effectively with geographically distributed teams, navigating different time zones and communication challenges to ensure project success
This position is open to all candidates.