we are looking for a Senior DevOps Engineer - ML Platform.
AI Engineering's ML-Platform team goal, is to deliver a modern infrastructure and solutions to enhance Algorithm development life cycle and shorten our delivery times. We are an independent group, consisting of excellent and experienced engineers with diverse skills in algorithms, software, and infrastructure. We strive to implement a DevOps culture allowing our engineers to easily collaborate on large-scale products.
We develop cross-company products that enable the research and deployment of state-of-the-art algorithms.
What will your job look like?
Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environments
Operate and enhance our multi‑cloud, multi‑cluster scheduling platform
Develop automation, tooling, and platform services und Bash
Troubleshoot complex issues across the stack: compute, networking, storage, orchestration, and distributed systems
Improve reliability of critical systems
Collaborate with ML, data, and backend teams to support evolving platform needs
Drive best practices in CI/CD, infrastructure-as-code, and system design
Participate in on‑call rotations for critical infrastructure components
Requirements: 10+ years of hands‑on experience in DevOps, SRE, systems engineering, or similar roles
Linux knowledge, including debugging, performance tuning, ana system internals
Proven experience working with HPC environments, large clusters, or high‑performance compute systems
Solid experience with Kubernetes (EKS or similar managed K8s services)
Knowledge of infrastructure‑as‑code tools(Terraform, Helm, etc.)
Hands‑on experience with:
PostgreSQL or similar relational databases
Elasticsearch or similar search/indexing systems
Prometheus/Thanos/Grafana or similar observability stacks
RabbitMQ or similar messaging systems
Strong proficiency in Bash, networking fundamentals, and debugging distributed systems.
Experience investigating complex issues across compute, storage, networking, and orchestration layers
Advantages:
Experience with multi‑cloud architectures
Experience with workflow orchestration tools such as Argo Workflows (or similar systems like Airflow, Prefect, Flyte)
Familiarity with GPU scheduling, AI/ML pipelines, or data‑intensive workloads
Background in large‑scale distributed systems or platform engineering
Ability to write production‑quality Go (Golang) code
This position is open to all candidates.