We are seeking a skilled Site Reliability Engineer (SRE) to join our team and help build, maintain, and improve the reliability, scalability, and performance of our systems. As an SRE, you will be responsible for owning and evolving our observability tooling, using real-time insights to make data-driven decisions about system behavior and performance at runtime, and implementing automation to enhance our infrastructure. This role involves collaborating across teams to ensure a robust and efficient technology stack supporting mission-critical systems.
You will:
Proactively enhance system reliability, scalability, and performance through automation, monitoring, and capacity planning.
Develop and maintain observability systems, including distributed tracing, logging, and metrics platforms.
Establish and maintain organizational standards for monitoring, leveraging tools like Prometheus, Grafana, and OpenTelemetry.
Use observability tools to analyze runtime behavior and make data-driven decisions that improve system performance and reliability.
Partner with development teams to integrate reliability best practices into the software development lifecycle.
Manage infrastructure at scale in cloud services (AWS advantage) and platforms like Kubernetes.
Optimize resource utilization to reduce costs while maintaining service quality.
Contribute to the development and adoption of AI-driven tools and practices for engineering and observability.
What success looks like:
You are a trusted technical leader within the organization, mentoring others and helping shape the evolution of our SRE and observability practices.
You reduce the frequency and impact of production incidents by building resilient systems and using observability insights to address issues before they escalate.
You significantly improve observability: key metrics, logs, and traces are consistently available, well instrumented, and actionable across all critical services, enabling fast, informed decisions and rapid resolution of issues.
You are actively engaged in proactive problem solving: you identify and resolve systemic issues before they impact customers, and continuously refine SLOs and SLIs to reflect evolving business needs.
Requirements: We are looking for:
At least 6 years of experience as a SRE or DevOps.
Strong experience with Observability Tools such as OpenTelemetry, Grafana, Prometheus, and ELK stack (Elasticsearch, Logstash, Kibana).
In-depth experience with Cloud Platforms: AWS services, including EC2, S3, RDS, and CloudFormation/Terraform for infrastructure-as-code.
Strong experience working in Kubernetes environments, with a focus on Helm for deployment and configuration management
Experience working with AI and LLM tools such as Cursor, Claude Code or similar.
Proficiency in scripting and/or development languages such as Bash or Python.
Thorough understanding of CI/CD pipelines and automation tools.
Strong experience with automation tools like Terraform and/or Ansible, and understanding of Infrastructure as Code.
Solid troubleshooting and debugging skills.
A team player with a strong can-do mentality.
This position is open to all candidates.