we are seeking a senior AI Researcher to join its R&D group and lead the frontier of large-scale LLM optimization. You will focus on maximizing performance, scalability, and efficiency of LLM training and inference across massive GPU clusters, bridging deep learning research, distributed systems design, and hardware-aware optimization.
At our company, we treat AI performance as a systems problem. Just as we reinvented networking through disaggregation and software-defined scale, were applying the same philosophy to AI infrastructure. Your work will directly influence how large models are deployed, scaled, and optimized across high-density compute environments.
Responsibilities
Research, design, and implement new optimization strategies for large-scale LLM training and inference (e.g., tensor/pipeline/expert parallelisms, quantization, prefill/decode disaggregation, GPU communication optimization).
Profile distributed training and inference pipelines to identify algorithmic, memory, and scheduling inefficiencies.
Collaborate closely with systems, compiler, and infrastructure teams to co-design efficient communication topologies, memory management, and runtime scheduling.
Develop internal tools for huge-scale LLM benchmarking, profiling, and automatic tuning.
Validate research through measurable impact, higher throughput, better FLOPS utilization, improved convergence efficiency, or reduced compute cost.
Present research and engineering results to internal and external technical audiences.
Requirements: Deep understanding of deep learning internalstransformer architectures, distributed training paradigms, precision scaling, and optimizer behavior.
Proven hands-on experience training or deploying LLMs on multi-GPU or multi-node clusters.
Strong grasp of parallel and distributed systems principles, including communication collectives, load balancing, and scaling bottlenecks.
Proficiency with frameworks like DeepSpeed, Megatron-LM, NeMo VLLM, SGLang, or equivalent large-scale training ecosystems.
Demonstrated ability to translate theoretical optimization ideas into practical, production-level performance improvements.
Nice to Have
Understanding of CUDA, Triton, or low-level GPU kernel development, and experience profiling large models across multi-node GPU systems.
Experience co-designing algorithms and hardware (NVIDIA, AMD, TPU, or custom accelerators).
Research or open-source contributions in distributed ML systems, model compression, or systems for ML.
Exposure to energy or cost-aware optimization techniques in large-scale training environments.
This position is open to all candidates.