Required Senior ML Engineer - Applied AI Engineering Group
The Dream Job
It starts with you - an engineer driven to build the ML platform that turns research into reliable, production-grade intelligence. You care about reproducibility, low-friction experimentation, and infrastructure that earns the trust of the scientists and researchers who depend on it daily. You'll architect and ship our ML platform - training pipelines, model serving, feature stores, experiment tracking, and compute orchestration - turning models into production capabilities across cloud and on-prem, including air-gapped deployments. A significant part of the platform supports large language models, with unique challenges across training, evaluation, and inference in mission-critical environments.
If you want to make a meaningful impact, join our mission and build the ML platform that drives Sovereign AI products - this role is for you.
The Dream-Maker Responsibilities
Build and operate ML training infrastructure - distributed training pipelines, compute scheduling, and reproducible experiment workflows that data scientists rely on daily.
Own model serving and inference systems - packaging, deployment, autoscaling, A/B testing, canary rollouts, and latency/cost optimization for production models.
Run feature stores, model registries, and dataset versioning - enabling self-serve feature engineering, model lineage, and reproducible experiments across teams.
Build experiment tracking and evaluation infrastructure - automated evals, comparison dashboards, drift detection, and monitoring that give teams visibility into model behavior and performance.
Build and maintain production pipelines for training, fine-tuning workflows, and serving domain models - owning reliability, reproducibility, and scale.
Build and maintain the monitoring and observability layer - model performance tracking, data and prediction drift detection, data quality validation, and alerting.
Improve performance and cost across the ML stack - training throughput, inference latency, batch vs. real-time tradeoffs, and compute cost management.
Ship shared tooling - libraries, templates, CI/CD for models, IaC, and runbooks - while collaborating across Data Platform, AI, Data Science, Engineering, and DevOps. Own architecture, documentation, and operations end-to-end.
Requirements: 5+ years in software engineering, with 2+ years focused on ML infrastructure, MLOps, or data-intensive systems
Engineering craft - Strong Python, distributed systems design, testing, secure coding, API design, CI/CD discipline, and production ownership.
ML platform & serving - Model serving frameworks (e.g., Triton, TorchServe, vLLM, Ray Serve); model packaging, deployment pipelines, and inference optimization
Training infrastructure - Distributed training pipelines (e.g., frameworks like PyTorch, JAX) experiment orchestration and reproducibility
ML lifecycle tooling - Feature stores, model registries, experiment tracking (e.g., MLflow, Weights & Biases); dataset versioning and lineage
Data pipelines - Building training and inference data pipelines; familiarity with tools like Spark, Airflow/Dagster, and streaming ingestion
Comfortable with AI coding tools like Cursor, Claude Code, or Copilot
Nice to Have:
Experience operating in constrained environments - on-premise, private cloud, or air-gapped deployments
Hands-on experience with simulation environments, synthetic data generation, or reinforcement learning workflows
Platform & infra - Kubernetes, AWS, Terraform or similar IaC, CI/CD, observability, incident response
Hands-on data science or applied ML experience.
This position is open to all candidates.