Your Career:
Own and continuously improve AWS production infrastructure for scalability, reliability, security, performance, and cost.
Run and evolve Kubernetes environments that support fast, safe product delivery.
Drive developer velocity and production safety through better CI/CD pipelines, release workflows, deployment visibility, and GitOps practices.
Improve observability and incident response - reduce alert noise and raise signal quality.
Design and ship AI-assisted operational agents that change how engineers work - triaging monitoring alerts, summarizing incidents, proposing fixes, onboarding new services, answering questions and requests. This is a core part of the role, not a side project.
Build automation and self-service tooling that removes manual work from provisioning, monitoring, incident response, and developer workflows.
Analyze operational data across incidents, alerts, deployments, infra health, and cost to find reliability gaps, inefficiencies, and automation opportunities.
Partner with engineering, security, product, and leadership to remove bottlenecks and support safe production growth.
Evaluate and introduce new tools and AI-assisted approaches, balancing innovation with reliability, cost, and operational simplicity.
Your Impact:
You'll help scale production systems, improve deployment velocity and reliability, reduce operational overhead, and build automation and AI workflows that help engineering teams move faster and operate more efficiently.
This role is a strong fit for someone who enjoys ownership, collaboration, and operational innovation.
Requirements: Your Experience:
4+ years operating production infrastructure in AWS.
Deep hands-on experience with Kubernetes, Helm, ArgoCD, Terraform, and CI/CD.
Strong experience with observability and alerting in Datadog or comparable platforms.
Solid grounding in Linux, networking, cloud security, and reliability best practices.
Strong scripting skills in Python and Bash.
Proven ability to own platform projects end-to-end, from design through production operation and ongoing improvement.
Strong troubleshooting across distributed systems, Kubernetes, CI/CD, and live incidents.
Collaborative mindset - comfortable working across engineering, security, product, and leadership.
Comfort in a fast-paced, high-ownership environment where priorities shift but production quality doesn't.
Genuine interest in applying AI, automation, and intelligent workflows to operational work.
Key qualities
Ownership-driven - You take responsibility for the systems you build and operate, from design through production support and continuous improvement.
Collaboration - You work effectively across engineering, security, product, and leadership to align priorities and drive shared outcomes.
Developer experience focus - You are committed to reducing friction for engineering teams through thoughtful automation, self-service workflows, and reliable internal tooling.
Innovation balanced with pragmatism - You actively explore new approaches, particularly in AI-assisted operations, while weighing them against reliability, maintainability, and operational simplicity.
Security mindset - You design and build with least privilege, auditability, and production safety as foundational principles rather than afterthoughts.
Clear communication - You articulate infrastructure, reliability, cost, and security tradeoffs precisely to both technical and non-technical stakeholders.
This position is open to all candidates.