Required Experienced Site Reliability Engineer
AI Engineering's ML-Platform team goal, is to deliver a modern infrastructure and solutions to enhance our Algorithm development life cycle and shorten our delivery times. We are an independent group, consisting of excellent and experienced engineers with diverse skills in algorithms, software, and infrastructure. We strive to implement a DevOps culture allowing our engineers to easily collaborate on large-scale products. We develop cross-company products that enable the research and deployment of state-of-the-art algorithms.
What will your job look like?
Were looking for a Site Reliability Engineer (SRE) to design, build, and operate the infrastructure powering machine learning development.
You will work on large-scale, distributed Kubernetes cluster, both on-prem and in the cloud, collaborating with IT, cloud engineering, and algorithmic teams.
Owning and evolving our infrastructure-as-code (IaC).
Driving reliability, monitoring, and observability best practices.
Pushing software and hardware integration to its limits to maximize system performance.
Automating and scaling ML workflows with a reliability-first mindset.
Partnering with researchers and developers to improve performance, resilience, and delivery speed.
Participating in incident response, root cause analysis, and postmortems to continuously raise the bar.
This role is ideal for someone who thrives on solving complex infrastructure challenges and enjoys collaborating across disciplines to deliver impact.
Requirements: Strong knowledge of Linux internals and system administration.
Proficiency in Bash scripting and coding in Python and/or Go.
Hands-on experience with Kubernetes and containerized environments.
Experience with at least one major cloud provider (AWS, GCP, Azure, etc.).
Familiarity with Infrastructure as Code (Terraform etc.).
Solid understanding of networking, security, and distributed systems.
Experience with observability tools (Prometheus, Grafana, ELK, OpenTelemetry).
Strong communication, problem-solving, and collaboration skills.
A calm, reliability-first mindset under pressure, with passion for scalable systems.
Advantages:
B.Sc in Computer Science or related field
Contributions to open-source projects.
Experience with backend development & performance best practices.
Familiarity with databases, storage, and data pipeline reliability.
Enjoys exploring and stretching software and hardware systems to their full potential.
This position is open to all candidates.