Were looking for a NOC / Incident Manager to join our Production Operations team and play a key role in ensuring the stability and reliability of our systems. In this role, you will monitor production environments, detect and respond to incidents, and work closely with SREs and engineering teams to improve system resilience.
This is a hands-on role for someone who thrives in fast-paced environments, enjoys troubleshooting complex issues, and is passionate about reducing downtime and improving incident response processes.
Responsibilities:
Real-time monitoring of production systems to detect and respond to incidents.
Analyze and triage alerts, identifying root causes and escalating when necessary.
Manage live incidents, ensuring clear communication and timely resolution.
Document and improve incident response processes, including updating runbooks and playbooks.
Collaborate with SREs and developers to drive post-mortem analysis and implement long-term reliability improvements.
Reduce alert fatigue by tuning monitoring systems and ensuring alerts are actionable.
Participate in on-call rotations, ensuring 24/7 incident response coverage.
Proactively suggest improvements to monitoring, alerting, and automation strategies.
Requirements: Requirements:
2+ years of experience in a NOC, Incident Management, or technical support role.
Experience with monitoring tools (Grafana, Prometheus, ELK, Datadog, New Relic, etc.).
Strong troubleshooting skills, with a structured approach to problem resolution.
Ability to analyze logs and metrics to identify root causes of incidents.
Excellent communication skills, with the ability to coordinate across teams.
Familiarity with cloud environments (AWS, Azure, GCP) and modern infrastructure concepts.
Ability to work under pressure, responding to incidents in a high-scale production environment.
Bonus Points:
Experience with incident automation tools and self-healing mechanisms.
Scripting skills (Bash, Python) to automate tasks and improve monitoring.
Familiarity with on-call management tools like PagerDuty or Opsgenie.
Understanding of SRE principles and site reliability best practices.
This position is open to all candidates.