It starts with you - an engineer driven to build resilient, automated infrastructure that enables teams to move fast with confidence. You care about operational excellence, developer experience, and reliability at scale. Youll architect and operate the compute and networking infrastructure that powers our AI platform - from CI/CD pipelines to Kubernetes clusters to observability systems - across cloud and on-prem environments.
If you want to build infrastructure that powers mission-critical AI systems at national scale, join mission - this role is for you.
:Responsibilities
Architect and operate Kubernetes-based infrastructure across AWS and on-prem environments, ensuring high availability, security, and performance.
Design and maintain CI/CD pipelines for application and service deployments with automated testing, security scanning, and rollback capabilities.
Drive infrastructure-as-code practices for compute and networking - building reproducible, auditable, and version-controlled infrastructure.
Own reliability and incident response - establish SLOs, build alerting systems, lead incident resolution, and drive post-incident improvements.
Enable AI-native operations - support agentic deployment pipelines, self-healing infrastructure, and secure sandboxing for model experimentation.
Build and maintain observability systems - metrics, logging, tracing, and dashboards that provide visibility into system health.
Optimize infrastructure cost and performance - right-size resources, implement auto-scaling, and identify efficiency opportunities.
Collaborate with Engineering, Data Platform, Data Engineering, and Security teams to align infrastructure with platform needs.
Shape infrastructure characteristics that support data freshness, correctness, and low-latency pathways for AI training/inference, retrieval, and agentic workflows.
Contribute paved-road tooling - reusable CI/CD patterns for services, IaC modules for compute and networking, and runbooks - that streamline delivery across teams.
Collaborate with Engineering, Data Platform, Data Engineering, Security, Product, AI/ML, Data Science, and Analytics to anticipate and meet cross-functional needs.
Requirements: 6+ years in DevOps, SRE, or infrastructure engineering, with hands-on experience building and operating infrastructure at scale.
Container orchestration - Kubernetes (EKS, on-prem), Helm, service mesh technologies like Istio or Linkerd
Cloud & infrastructure - AWS services (EC2, EKS, S3, IAM, VPC, Lambda), hybrid cloud architectures, on-prem infrastructure
Infrastructure-as-Code - Terraform, Pulumi, or CloudFormation; GitOps practices with ArgoCD or Flux
CI/CD - GitHub Actions, GitLab CI, Jenkins, or similar; artifact management, deployment strategies (blue-green, canary)
Observability - Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar; distributed tracing, log aggregation, alerting
Security & compliance - Secrets management (Vault, AWS Secrets Manager), network security, compliance automation
Scripting & automation - Python, Bash, Go; configuration management with Ansible or similar
This position is open to all candidates.