We're looking for a Senior SRE Engineer who combines strong infrastructure expertise with solid programming skills to help scale our platform, who can balance operational excellence with software development.
This is an exciting opportunity to build SRE processes from the ground up - creating new reliability pipelines, monitoring frameworks, and foundational practices that will scale with our rapid growth.
You'll lead our infrastructure and reliability efforts while writing code to automate, optimize, and enhance our systems. This role requires both deep technical expertise and the ability to mentor team members as we scale.
Stack: AWS, Python, EKS, K8s, Kafka, RabbitMQ, Pulumi, PostgreSQL, Databricks, GitHub Actions
Core Responsibilities:
Design and implement scalable, reliable infrastructure solutions on AWS using Infrastructure as Code (Terraform/Pulumi).
Build and maintain sophisticated CI/CD pipelines with GitOps methodologies.
Develop custom tooling and automation scripts in Python/Go/similar languages to improve operational efficiency.
Architect and implement comprehensive observability solutions (metrics, logging, tracing, alerting).
Define and track SLIs/SLOs/Error Budgets to ensure system reliability.
Lead incident response, conduct thorough post-mortems, and drive systemic improvements.
Optimize cloud costs through data-driven analysis and architectural improvements.
Collaborate with development teams to improve application reliability and performance.
Mentor team members on SRE best practices and infrastructure design patterns.
Requirements: 5+ years of DevOps/SRE experience in production environments.
Solid programming skills in at least one language (Python, Go, Java, or similar) with ability to write production-quality code.
Strong understanding of SRE principles: reliability engineering, capacity planning, chaos engineering.
Deep expertise with Kubernetes (EKS preferred) including operators, CRDs, and advanced networking.
Proven experience implementing Infrastructure as Code at scale.
Hands-on experience with observability stacks (Prometheus, Grafana, ELK, Datadog, or similar).
Experience with distributed systems concepts and troubleshooting.
Excellent problem-solving skills with a systematic approach to debugging.
Strong communication skills and ability to work across teams.
What Sets You Apart:
You write code to solve operational problems, not just configure existing tools.
You think in systems and can identify root causes across complex architectures.
You're passionate about automation and eliminating toil.
You balance perfectionism with pragmatism to deliver reliable solutions quickly.
You stay current with cloud-native technologies and best practices.
You can translate technical concepts for various audiences.
This position is open to all candidates.