It starts with you - a technical leader whos passionate about building resilient, automated infrastructure and growing high-performing teams. You care about operational excellence, developer experience, and enabling AI-driven teams to move fast with confidence. Youll lead the DevOps team in architecting and operating the compute and networking infrastructure that powers our AI platform - from CI/CD pipelines to Kubernetes clusters to observability systems.
If you want to lead a team that builds the infrastructure foundation for mission-critical AI systems, join mission - this role is for you.
:Responsibilities
Lead and grow the DevOps team - hiring, mentoring, and developing engineers while fostering a culture of ownership and continuous improvement.
Define compute and networking infrastructure strategy across cloud and on-prem environments; drive architectural decisions that balance reliability, security, cost, and velocity.
Own the platforms deployment, scaling, and operational posture - ensuring systems meet demanding SLAs for government and national-scale customers.
Build and evolve CI/CD pipelines for application and service deployments with automated testing, security scanning, and rollback capabilities.
Drive infrastructure-as-code practices for compute, networking, and orchestration - ensuring reproducible, auditable, and version-controlled infrastructure across all environments.
Enable AI-native operations - support agentic deployment pipelines, self-healing infrastructure, and secure sandboxing for model experimentation.
Establish observability, alerting, and incident response practices that provide visibility into system health and enable fast recovery.
Partner with Engineering, Data Platform, Data Engineering, and Security teams to align infrastructure capabilities with platform needs.
Establish infrastructure characteristics (availability, latency, throughput) that enable data freshness, correctness, and low-latency pathways for AI training/inference, retrieval, and agentic workflows.
Ship paved-road developer tooling - shared templates, CI/CD workflows for services, IaC modules for compute and networking, and runbooks - to standardize best practices across engineering teams.
Collaborate with Engineering, Data Platform, Data Engineering, Security, Product, AI/ML, Data Science, and Analytics to align infrastructure capabilities with evolving platform and data product needs.
Requirements: 8+ years in DevOps, SRE, or infrastructure engineering, with 2+ years leading teams or technical functions. Hands-on experience building and operating infrastructure at scale.
Container orchestration - Kubernetes (EKS, on-prem), Helm, service mesh technologies like Istio or Linkerd
Cloud & infrastructure - AWS services (EC2, EKS, S3, IAM, VPC, Lambda), hybrid cloud architectures, on-prem infrastructure
Infrastructure-as-Code - Terraform, Pulumi, or CloudFormation; GitOps practices with ArgoCD or Flux
CI/CD - GitHub Actions, GitLab CI, Jenkins, or similar; artifact management, deployment strategies (blue-green, canary)
Observability - Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar; distributed tracing, log aggregation, alerting
Security & compliance - Secrets management (Vault, AWS Secrets Manager), network security, compliance automation, air-gapped environments
Scripting & automation - Python, Bash, Go; configuration management with Ansible or similar
This position is open to all candidates.