Were looking for a Site Reliability Engineer (SRE) to enhance the reliability, performance, and scalability of our production infrastructure. This role goes beyond keeping systems runningyoull be a key player in shaping the culture of reliability, driving self-healing mechanisms, proactive alerting strategies, and automation to reduce toil and improve operational efficiency. You'll work closely with engineering teams to ensure high availability, observability, and smooth incident management processes.
Responsibilities
Ensure reliability & scalability of our production environment across multiple cloud providers.
Define and implement SRE best practicesfostering a culture of ownership, continuous improvement, and automation.
Automate everythingfrom infrastructure deployment to self-healing mechanisms that eliminate manual intervention.
Design and improve observability solutions (monitoring, logging, tracing) to enable faster detection and resolution of issues.
Optimize alerting strategies to ensure actionable, high-quality alerts while minimizing noise and fatigue.
Improve system resilience, driving chaos engineering, failover strategies, and automatic recovery processes.
Enhance incident response processes, including on-call strategies, root cause analysis, and post-mortems to drive long-term stability.
Collaborate with development teams to build reliable, scalable, and efficient architectures, ensuring seamless deployment and rollback processes.
Promote a culture of reliability, educating teams on best practices, service ownership, and production-readiness.
Requirements: 3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Strong expertise in Kubernetes and container orchestration in production.
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
Proven experience with monitoring & observability tools (Prometheus, ELK, Grafana, Coralogix, etc.).
Strong scripting/programming skills (Python, Go, Bash, or similar).
Experience with Infrastructure as Code (IaC)Terraform, Helm, or similar tools.
Track record of improving system reliability, scalability, and performance.
Experience designing and implementing self-healing mechanisms to minimize human intervention.
Ability to foster a strong reliability culture across engineering teams, leading by example.
Excellent problem-solving skills, with a proactive and ownership-driven mindset.
This position is open to all candidates.