We are seeking a highly skilled and motivated Team Leader to build and lead a new team dedicated to developing orchestration tools and software solutions for AI datacenters.
The main goal of this team is to design and deliver customer-focused orchestration platforms that simplify the deployment, management, and monitoring of large-scale AI workloads.
This role combines technical leadership with hands-on development, covering the entire AI datacenter ecosystem including switches, hosts, smart NICs, GPUs, ROCm, and RCCL. The team will primarily develop in Python, complemented by modern full-stack technologies for user interfaces and control systems.
Key Responsibilities:
Lead and mentor a team of engineers building orchestration tools that manage complex AI datacenter infrastructures.
Define the teams vision, roadmap, and architecture for orchestration solutions that enhance customer experience and operational efficiency.
Design and implement distributed control and orchestration systems using Python and full-stack frameworks.
Collaborate with networking, compute, and AI acceleration teams to integrate orchestration capabilities across all datacenter components (switches, NICs, GPUs, and software stacks).
Work closely with product, QA, and DevOps teams to identify customer requirements and translate them into scalable, production-grade orchestration platforms.
Ensure software reliability, scalability, and maintainability through strong design principles, testing, and CI/CD practices.
Foster a culture of innovation, technical excellence, and cross-functional collaboration.
Requirements: 5+ years of software development experience, including 2+ years in a team leadership or technical lead role.
Strong proficiency in Python for backend, orchestration, and systems integration.
Proven experience in designing and implementing orchestration or control-plane systems for datacenter or cloud environments.
Deep understanding of datacenter infrastructure networking, compute, storage, or GPU acceleration.
Hands-on experience with containers, orchestration frameworks, and CI/CD pipelines (Kubernetes, Docker, etc.).
Excellent problem-solving, leadership, and communication skills.
Preferred Qualifications:
Experience with AI workloads and GPU software stacks (ROCm, RCCL, PyTorch, TensorFlow).
Familiarity with control-plane architectures, distributed systems, or cluster management frameworks.
Background in telemetry, resource scheduling, or performance optimization for large-scale systems.
Knowledge of microservices, REST/gRPC APIs, and cloud-native architectures.
Practical experience with full-stack development (React, Angular, Node.js, or similar).
Experience with testing frameworks (pytest, Robot Framework, etc.).
This position is open to all candidates.