We are looking for an experienced SRE Team Lead to drive the reliability, observability, and automation practices across our private cloud infrastructure and operations. In this role, you will lead a team of site reliability engineers, own the engineering roadmap for monitoring and automation, and act as a key liaison between development, operations, and platform teams. You bring at least 3-4 years of hands-on people management experience and a deep technical background in SRE or DevOps disciplines.
What will you do?
Leadership & Team Management
Lead, mentor, and grow a team of SREs, providing technical direction, career development guidance, and day-to-day management.
Own the team roadmap for reliability, observability, and automation initiatives - prioritizing work, removing blockers, and driving delivery.
Conduct regular 1:1s, performance reviews, and hiring processes to build and sustain a high-performing team.
Foster a culture of operational excellence, blameless post-mortems, and continuous improvement.
Act as an escalation point for complex incidents and reliability issues, leading post-incident reviews and ensuring follow-through on action items.
Automation & Infrastructure
Design, develop, and maintain automation tools to support infrastructure and operations teams at scale.
Manage pipelines and infrastructure workflows using Jenkins, Ansible, Python, and Bash.
Drive the adoption of infrastructure-as-code practices across the organization.
Collaborate with system engineers to improve scalability, performance, and fault tolerance of critical systems.
Monitoring & Observability
Build and extend monitoring and alerting systems using Grafana, the ELK (Elastic) stack, Zabbix, and custom scripts.
Implement and enforce observability best practices to ensure full visibility into systems, applications, and infrastructure.
Define and track SLIs, SLOs, and error budgets across key services.
Partner with development teams to embed observability earlier in the software development lifecycle.
Database & Platform Support
Support monitoring and infrastructure integration for databases including MongoDB and PostgreSQL.
Maintain documentation and champion knowledge sharing around automation, monitoring, and reliability practices.
Requirements: What you need:
Experience & Leadership
3-4+ years of experience in a people management or team lead capacity within SRE, DevOps, or infrastructure engineering.
5-8+ years of overall experience in SRE, DevOps, or infrastructure automation roles.
Proven track record of building, coaching, and retaining high-performing engineering teams.
Experience owning an engineering roadmap and driving cross-functional reliability initiatives.
Technical Skills
Strong scripting skills in Python and Bash; comfortable building and maintaining production-grade automation.
Hands-on experience with infrastructure automation tools, particularly Ansible.
Solid experience with monitoring and observability platforms - ELK stack, Grafana, and Zabbix.
Good understanding of CI/CD pipelines and related tooling, including Jenkins.
Familiarity with managing and monitoring MongoDB and PostgreSQL in a production environment.
Comfortable working in Linux-based environments.
Excellent problem-solving skills and strong written and verbal communication.
Ability to support the following:
Experience with cloud providers - AWS, GCP, or Azure.
Exposure to containerization technologies such as Docker and Kubernetes.
Familiarity with infrastructure provisioning using Terraform.
Experience introducing SRE practices (SLOs, error budgets, chaos engineering) at an organizational level.
Exposure and experience with migrating/ building AI tools to improve process.
This position is open to all candidates.