We are seeking a skilled Site Reliability Engineer (SRE) to join our team and help build, maintain, and improve the reliability, scalability, and performance of our systems. As an SRE, you will be responsible for owning observability tools, driving incident management processes, and implementing automation to enhance our infrastructure. This role involves collaborating across teams to ensure a robust and efficient technology stack supporting mission-critical systems.
You will:
Proactively enhance system reliability, scalability, and performance through automation, monitoring, and capacity planning.
Develop and maintain observability systems, including distributed tracing, logging, and metrics platforms.
Establish and maintain organizational standards for monitoring, leveraging tools like Prometheus, Grafana, and OpenTelemetry.
Drive incident management, root cause analysis, and continuous improvement initiatives.
Partner with development teams to integrate reliability best practices into the software development lifecycle.
Manage infrastructure at scale in cloud services (AWS advantage) and platforms like Kubernetes or ECS.
Optimize resource utilization to reduce costs while maintaining service quality.
Requirements: At least 5 years of experience as a SRE.
Strong experience with Observability Tools: Proficiency with OpenTelemetry, Grafana, Prometheus, and ELK stack (Elasticsearch, Logstash, Kibana).
Experience with Cloud Platforms: In-depth knowledge of AWS services, including EC2, S3, RDS, and CloudFormation/Terraform for infrastructure-as-code.
Proficiency in scripting and/or development languages like Bash or Python.
Thorough understanding of CI/CD pipelines and automation tools.
Understanding of Infrastructure as Code, and strong experience with automation tools like Terraform and/or Ansible.
Solid troubleshooting and debugging skills.
A team player with a strong can-do mentality.
This position is open to all candidates.