Required ML Engineering Team Lead - Applied AI Engineering Group
The Dream Job
It starts with you - a technical leader driven to build both the ML platform and the engineering team behind it. You care about reliable infrastructure, great developer experience, and growing engineers through real ownership. You'll set the technical direction for our ML platform - training pipelines, model serving, feature stores, experiment tracking, and compute orchestration - shaping how models reach production across cloud and on-prem, including air-gapped deployments. A significant part of the platform supports large language models, with unique challenges across training, evaluation, and inference in mission-critical environments. You stay close enough to the codebase to debug production issues, unblock your engineers, and make sound architecture calls.
If you want to make a meaningful impact, join our mission and lead the team that builds the ML platform driving Sovereign AI products - this role is for you.
The Dream-Maker Responsibilities
Set technical direction for the ML platform - training pipelines, model serving, feature stores, experiment tracking, and compute orchestration - through RFCs, prototypes, design reviews, and build-vs-buy decisions
Lead and grow a team of ML Engineers - hire, mentor, pair on hard problems, and raise the bar through code and design reviews
Contribute to critical systems, debug production issues, and maintain deep context on the codebase to inform technical decisions
Own operational excellence for model serving - set and enforce SLAs, run capacity planning, and keep compute costs predictable
Establish ML engineering standards - reproducible experiments, automated evals, model packaging, CI/CD for models, and observability
Support the full lifecycle of our models - from training on domain-specific data to low-latency inference powering production systems
Work closely with Data Platform, AI, Data Science, and Product teams - translate business priorities into engineering work and manage cross-team dependencies
Measure and improve developer experience - deploy friction, onboarding time, CI turnaround - as seriously as model performance.
Requirements: 6+ years in software engineering, ML engineering, or platform engineering, with hands-on experience building and operating ML infrastructure at scale.
2+ years leading an engineering team - hiring, mentoring, conducting design reviews, and shipping alongside your team
Engineering craft - Strong Python, distributed systems design, testing, secure coding, API design, CI/CD discipline, and production ownership.
ML platform & serving - Model serving frameworks (e.g., Triton, TorchServe, vLLM, Ray Serve); model packaging, deployment pipelines, and inference optimization
Training infrastructure - Distributed training pipelines (e.g., frameworks like PyTorch, JAX) experiment orchestration and reproducibility
ML lifecycle tooling - Feature stores, model registries, experiment tracking (e.g., MLflow, Weights & Biases); dataset versioning and lineage
Data pipelines - Building training and inference data pipelines; familiarity with tools like Spark, Airflow/Dagster, and streaming ingestion
Comfortable with AI coding tools like Cursor, Claude Code, or Copilot
Nice to Have:
Experience operating in constrained environments - on-premise, private cloud, or air-gapped deployments
Hands-on experience with simulation environments, synthetic data generation, or reinforcement learning workflows
Platform & infra - Kubernetes, AWS, Terraform or similar IaC, CI/CD, observability, incident response
Hands-on data science or applied ML experience.
This position is open to all candidates.