You will join a global DevOps & Platform team responsible for designing and operating distributed production systems at very high scale, with a relentless focus on quality, robustness, and reliability. We provide our company with the tools it needs to push forward product innovation and scale in the age of AI.
You will own projects end-to-end, drive new implementations forward, and act as a technical partner across the organization. Youll work closely with the Platform teams, Development teams, Architects, AI/ML Researchers, and Product, translating complex needs into resilient systems that scale.
Were looking for someone who takes ownership, who can see a problem or a need, shape the solution, and carry it all the way to production.
The day-to-day
Lead initiatives end-to-end, from design through implementation to production rollout, owning the outcome, not just the task
Design, build, and operate highly available, distributed systems at scale on AWS and GCP, with a strong emphasis on quality and robustness
Work across the full platform stack, Kubernetes, Docker, and Helm, CI/CD and automation, and the data, caching, and messaging layers (Mongo, MySQL, Postgres, Redis, Kafka, and more)
Modernize and consolidate different stacks and architectures into our single, coherent platform
Build and operate the infrastructure behind our AI workloads, including GPU inference and the platforms that serve them
Drive system reliability, monitoring, and observability, and troubleshoot complex issues across application, infrastructure, networking, OS, and data layers
Collaborate across Platform, Development, Architects, AI/ML Researchers, and Product, and leverage modern AI coding tools to work at high velocity and quality
Requirements: Ideally, were looking for:
2+ years of hands-on experience in DevOps / Platform / SRE roles, operating distributed production systems at high scale, with a proven ability to lead projects independently and drive new implementations end-to-end
Deep hands-on expertise with AWS in production (EKS, ECS, S3, EC2, CloudFront, IAM, Bedrock, MSK, Lambda, Cloudwatch) including Infrastructure as Code (Terraform or similar)
Deep hands-on expertise with Kubernetes and Helm, including a strong understanding of how Kubernetes internals work and how to operate them in production, alongside Docker / containerized systems
Strong knowledge of networking and operating systems (TCP/IP, DNS, load balancing, Linux internals)
Strong experience with CI/CD and automation (GitHub Actions), monitoring and observability (metrics, tracing, logging, tools like Prometheus / Grafana)
High proficiency with current AI coding tools, including a deep understanding of how they work and how to apply them to drive real productivity.
These would also be nice:
Experience integrating or migrating systems between different stacks or organizations
Experience driving cross-team technical initiatives
Experience working with GPUs for inference and serving AI/ML workloads
This position is open to all candidates.