we are looking for a NOC Team Leader.
The NOC Team Lead is responsible for ensuring the high availability, reliability, and performance of a company's production systems, infrastructure, and applications. This hybrid role bridges traditional reactive monitoring (Network Operations Center - NOC) with proactive, automated, and software-centric engineering practices (Site Reliability Engineering - SRE).
Key Responsibilities:
Operational Leadership: Directs 24/7 NOC and SRE teams, managing incident response, service uptime, and operational excellence.
Proactive Monitoring & Observability: Implements monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, Datadog) to track system health via golden signals (latency, traffic, errors, saturation).
Automation and Toil Reduction: Leads efforts to automate repetitive operational tasks and manual troubleshooting to improve system reliability and reduce human error.
Incident Management & Root Cause Analysis (RCA): Oversees the management of critical incidents, ensures timely communication, and performs post-mortem analysis to prevent recurrence.
Team Management & Development: Coaches, mentors, and develops NOC engineers and SREs, fostering a culture of high performance and continuous improvement.
Stakeholder Collaboration: Partners with engineering, development, and IT teams to align system performance with business goals and SLA requirements.
Requirements: Experience: 5+ years in managing technical teams within NOC, SRE, or infrastructure domains.
Technical Skills: Proficiency in cloud platforms (AWS, Azure, GCP), Kubernetes, Linux/Unix, and scripting languages (Python, Bash, Golang).
Tools: Experience with monitoring platforms (Datadog) and CI/CD tools (e.g., Jenkins, GitLab).
Soft Skills: Strong communication, leadership, and crisis management skills.
This position is open to all candidates.