We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational maturity of this platform.
As a Senior Site Reliability Engineer Lead , you will be responsible for:
Leading and mentoring an experienced SRE team while defining and driving the reliability strategy of a production-grade cybersecurity SaaS platform.
Designing and evolving multi-cluster Kubernetes environments across cloud providers while owning availability, performance, and incident management processes.
Establishing and enforcing SLOs/SLAs, error budgets, and production standards.
Driving infrastructure as code and automation standards (Terraform, CI/CD) while improving observability, monitoring, and operational visibility across the system.
Performing and lead root cause analysis for complex production incidents
Partnering with R&D, Security, and Product to align reliability with rapid delivery
Shaping architectural decisions at the platform level.
Requirements: Have 2+ years leading or managing infrastructure/SRE teams.
Have solid hands-on Kubernetes production experience
Have experience operating cloud environments (GCP, AWS, Azure, or similar) with a good understanding of reliability engineering principles (SLOs, SLAs, error budgets).
Have experience with infrastructure as code and automation (Terraform, Ansible).
Have Software engineering experience with exceptional Linux and networking troubleshooting skills.
Have proven experience handling production incidents and conducting root cause analysis.Ability to drive technical standards across teams.
Demonstrate clear and structured communication skills.
This position is open to all candidates.