our ml platform group builds and operates the core infrastructure that powers large scale ai workloads. we manage a massive, high performance environment consisting of both multi cloud clusters and on prem bare metal nodes optimized with ai accelerators. we are looking for a highly experienced senior sre / Linux systems engineer who thrives on managing complex, low level infrastructure. this isn't just a cloud-configuration role, you will be responsible for the health and performance of expensive, high density hardware. you must be an expert at troubleshooting open source systems and "living" inside Linux environments to ensure our ai clusters run at peak efficiency. what will your job look like?
build and maintain infrastructure for largescale ai and hpc workloads across onprem and cloud environments
operate and enhance our multicloud, multicluster scheduling platform
troubleshoot complex issues across the stack: from Kernel -level tuning and drivers to networking, Storage, and distributed system bottlenecks.
ensure the reliability of critical platform services: queuing systems, time-series databases, and logging pipelines
develop deeply integrated automation and tooling
collaborate with ml engineers and it engineers to optimize hardware utilization for data intensive workloads
drive best practices in system design, observability, and infrastructure-as-code
Requirements: all you need is:
10+ years of handson experience in sre, Linux administration, or systems engineering
expert-level Linux knowledge: deep understanding of system internals, debugging, performance tuning, and the ability to solve failures where hardware meets software.
kubernetes expertise: proven experience managing k8s at scale (both managed eks and bare-metal deployments)
distributed systems mastery: hands-on experience debugging and maintaining:queuing systems: rabbitmq or similar
metrics/observability stacks: prometheus, thanos, and grafana, or similar
logging: elasticsearch or similar
relational databases: postgresql, or similar
infrastructure-as-code: proficiency with terraform, helm, and configuration management
networking & scripting: strong fundamentals in networking and proficiency in bash
familiarity with gpu/accelerator scheduling, ai/ml pipelines
experience with multi cloud architectures and hybrid environments
experience with workflow orchestration tools (e.g., argo workflows) what we offer:
iimpact: support the engineering that advances our ai and global transportation safety
cutting-edge hardware: work with high-value, ai-optimized bare-metal clusters at a massive scale
technical depth: a highly technical environment focused on solving deep systems engineering challenges
collaboration: work alongside elite ml, software, and systems engineers we change the way we drive, from preventing accidents to semi and fully autonomous vehicles. if you are an excellent, bright, hands-on person with a passion to make a difference come to lead the revolution!
This position is open to all candidates.