Key Responsibilities:
Ensure critical systems meet uptime and performance SLAs (Service Level Agreements) and SLOs (Service Level Objectives)
Participate in on-call rotations, lead post-mortems, and drive root cause analysis
Implement redundancy, failover, and high availability strategies to keep services running smoothly.
Build and maintain robust monitoring, alerting, and observability systems (e.g., Prometheus, Grafana, Datadog)
Ensure the security of infrastructure and pipelines by implementing best practices for access control, encryption, and vulnerability management.
Collaborate with DevOps/Dev teams to build, maintain, and improve CI/CD pipelines
Have fun with a great team while tackling hard challenges.
Requirements: 5 years of experience designing, deploying, maintaining, and troubleshooting large-scale distributed systems.
Hands-on experience with infrastructure services such as caching systems, message queues, distributed storage, and load balancers.
Proven experience in building and maintaining monitoring solutions using tools like Prometheus, Grafana, or equivalent platforms.
5 years of hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes.
At least 3 years of experience working with cloud platforms
Understanding of network security principles (e.g., segmentation, firewalls, VPNs, zero trust)
Familiarity with securing cloud resources: encryption, security groups, secrets management, etc
Cloud certifications Advantage
Bachelor's degree (Computer Science, Computer Engineering, Data science) - Advantage
This position is open to all candidates.