We are looking for an outstanding Research Team Lead with deep hands-on expertise in large-scale distributed systems, AI framework infrastructure, and performance optimization on custom accelerators. If you are a visionary technical leader, a skilled communicator, and a team builder who thrives at the frontier of systems and AI research - you're welcome on board.
What Will You Be Doing?
Lead and grow a world-class team of researchers and engineers working on distributed AI infrastructure and systems software
Architect and own the software infrastructure enabling distributed training and inference on Huawei's custom accelerator hardware (e.g., Ascend NPU)
Drive research and development on communication libraries, runtime systems, memory management, and graph execution & synchronization at scale
Optimize end-to-end performance across large-scale clusters, covering both scale-up (multi-device) and scale-out (multi-node) configurations
Design and implement high-performance communication backends and collective operations (AllReduce, AllGather, broadcast) for distributed training workloads
Collaborate cross-functionally with hardware architects, compiler teams, and framework engineers to co-design hardware-software solutions
Publish and present research findings at top international venues (NeurIPS, EuroSys, SC, MLSys, OSDI) and represent the team externally
Mentor engineers and researchers, conduct performance and growth reviews, and shape team culture and technical direction
Partner with top academic institutions and open-source communities to advance the state of the art in distributed AI systems
Requirements: B.Sc. or higher in Computer Science, Computer Engineering, Electrical Engineering, or a closely related field
8+ years of experience in systems software, distributed computing, or AI infrastructure, with 3+ years in a leadership or team lead role
Deep expertise in large-scale communication systems: collective communication, RDMA, network topology-aware routing, and bandwidth optimization
Hands-on experience building software infrastructure for distributed training on custom accelerators or heterogeneous hardware (GPU, NPU, TPU)
Strong knowledge of runtime systems: scheduling, execution graphs, kernel dispatch, synchronization primitives, and pipeline management
Experience with memory management at scale: activation checkpointing, tensor offloading, rematerialization, KV cache management
Proficiency in C/C++ and Python, with a focus on high-performance, production-quality code in Linux environments
Proven ability to define technical vision, lead multi-person projects end-to-end, and deliver results under research and engineering timelines
Excellent communication skills in English - confident presenting to international audiences, writing technical reports, and driving cross-team alignment
Strong collaborative mindset and experience working in globally distributed, multicultural teams
Ways to Stand Out From the Crowd
M.Sc. or Ph.D. in a relevant field, with a strong publication record at systems or ML venues (EuroSys, OSDI, SC, NeurIPS, MLSys, ISCA)
Hands-on experience with communication frameworks such as NCCL, MPI, HCCL, or UCX
Experience with compiler and graph optimization for AI workloads (XLA, TVM, Triton, or custom operator fusion)
Background in mixed-precision training, model parallelism (Tensor Parallelism, Pipeline Parallelism, Expert Parallelism), and large model co-design
Experience profiling and debugging performance bottlenecks on heterogeneous clusters using tools like Chrome tracing, nsight, or custom profilers
This position is open to all candidates.