Our team at the Computing Network Innovation Lab is looking for exceptional talent to join us and lead the development of next generation data centers.
We create cutting-edge technologies that synergize software and hardware in tandem to accelerate compute, storage and networking at large-scale. We aim to drive innovations and deliver software defined infrastructure and algorithms to HPC, AI/ML, and Big Data applications.
We are looking for outstanding candidates with hands-on experience in development and optimization of AI frameworks. If you are a team player with excellent communication skills and motivation to revolutionize application performance, youre welcome on board!
What will you be doing?
Work as part of an innovative research team to analyze, develop, test and deploy improvements that enhance Huaweis distributed AI framework.
Develop optimizations that leverage hardware accelerator capabilities, minimize communication overhead and improve training/inference throughput
Research state-of-the-art, distributed AI training and inference algorithms (e.g. FSDP, DDP) to develop accessible model sharding capabilities
Profile different distributed AI training strategies, compare parallelization methods, and identify the main bottlenecks to be optimized on the computation and network communication levels.
Work in a distributed computing environment to optimize for both scale-up (multi-device) and scale-out (multi-node) systems
Utilize advanced concepts such as Uncertainty Quantification, Mixed Precision Computing and Model Sparsity to improve performance and enable training of very large AI models
Collaborate with partners from top universities, and open-source communities to conduct state-of-the-art research.
Requirements: B.Sc. degree in computer science, computer engineering, or a closely related field
Excellent C/C++ programming and software design skills, including debugging, performance analysis, and testing
Strong technical skills and experience with developing code in a Linux environment
Excellent teamwork and interpersonal skills
Ability to work independently, define project goals and scope, and lead your own development effort
Innovative thinking
Ways to stand out from the crowd:
M.Sc. or Ph.D. degree
Proven track record of conducting and publishing independent research
Experience in optimizing distributed deep learning pipelines with TensorFlow / PyTorch
Experience in analyzing workloads on large scale heterogeneous clusters
Hands-on experience in developing code to target heterogeneous architectures (e.g. CPU/GPU/TPU)
Experience in developing and contributing to large open-source libraries.
This position is open to all candidates.