Were growing and looking to hire SRE Team Leader who embodies our core values: People First, Customer Obsession, Strive for Excellence, and Integrity.
Responsibilities
As an SRE Team Leader, Your impact will be:
Site Reliability Engineering (SRE)
Production Gatekeeper: Design and enforce the rollout strategy for new technologies and oversee their execution to ensure minimal disruption to existing systems.
Production On-Call: Act as the first line of response for critical incidents, assessing issues, triaging, and coordinating with the team to prevent further issues and swiftly restore services.
Monitor Production Performance and Degradation: Keep a close eye on system performance metrics and detect any degradation early to prevent outages and disruptions.
Production Maintenance: Conduct regular infrastructure upgrades to accommodate changes, developments, and advancements in the technological landscape.
Manage Release Flow: Oversee the release of updates and new functionalities, ensuring a seamless transition while handling any potential negative impacts on production.
Staging Management: Oversee the management of the staging environment, ensuring that it accurately represents the production environment for effective testing and simulation.
Network Operations Center (NOC)
Build Playbooks: Develop and maintain comprehensive playbooks for managing system issues and incidents, setting guidelines for troubleshooting, escalation, and resolution processes.
Build Monitoring Dashboards: Design, set up, and maintain monitoring dashboards to visualize and track system performance and incidents in real-time.
Alerts and Incident Management: Establish protocols for issuing alerts in the event of system issues or anomalies and lead the team in incident resolution.
Requirements: What do you need to succeed in this role?
Proven experience in SRE/DevOps roles (NOC role - advantage) and team management experience
Strong leadership qualities and team management skills.
Tech stack - Jenkins, TF, Ansible, Bash, Python, AWS, Argo
Expertise in system monitoring and incident management tools
Exceptional problem-solving and analytical skills
Excellent written and verbal communication abilities.
A Bachelor's degree in Computer Science, Information Technology, or a related field - Advantage
Familiarity with Agile methodologies.
This position is open to all candidates.