Senior Software Architect - Deep Learning and HPC Communications

עדכון קורות החיים לפני שליחה

8321599

שירות זה פתוח ללקוחות VIP בלבד

משרות דומות שיכולות לעניין אותך

דיווח על תוכן לא הולם או מפלה

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior HPC Performance Engineer

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

we are leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
Come work for the team that brought to you NCCL, NVSHMEM & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications! We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize our company's vision?
What you will be doing:
Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
Triage and root-cause performance issues reported by our customers
Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information
Collaborate with a very dynamic team across multiple time zones.

Requirements:
M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience
3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
Experience conducting performance benchmarking and triage on large scale HPC clusters
Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
Implement micro-benchmarks in C/C++, read and modify the code base when required
Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python
Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)
Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones
Ways to stand out from the crowd:
Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control
Experience debugging network issues in large scale deployments
Familiarity with CUDA programming and/or GPUs
Experience with Deep Learning Frameworks such PyTorch, TensorFlow.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321604

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior System Software Engineer, NCCL - Partner Enablement

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

we are leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
Come work for the team that brought to you NCCL, NVSHMEM & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications! We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL. Most DL/HPC applications run on large clusters with high-speed networking (Infiniband, RoCE, Ethernet). This is an outstanding opportunity to get an end to end understanding of the AI networking stack. Are you ready for to contribute to the development of innovative technologies and help realize our company's vision?
What you will be doing:
Engage with our partners and customers to root cause functional and performance issues reported with NCCL
Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
Document and conduct trainings/webinars for NCCL
Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

Requirements:
B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
Experience working with engineering or academic research community supporting HPC or AI
Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
Expert in Linux fundamentals and a scripting language, preferably Python
Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
Adaptability and passion to learn new areas and tools
Flexibility to work and communicate effectively across different teams and timezones
Ways to stand out from the crowd:
Experience conducting performance benchmarking and developing infrastructure on HPC clusters. Prior system administration experience, esp for large clusters. Experience debugging network configuration issues in large scale deployments
Familiarity with CUDA programming and/or GPUs. Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow
Deep understanding of technology and passionate about what you do.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321595

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

04/09/2025

Senior Software Engineer, Fabric Networking - GPU

חברה חסויה

Location: More than one

Tel Aviv-Yafo

Ra'anana

Yokne`am

Job Type: Full Time

we are leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. we are looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.
We are looking for highly motivated Senior Software Engineers to work on our GPU NVLink Fabric Networking team. Youll be part of a team responsible for defining the next generation communications standards and products building on our current NVLink and NVSwitch technology.
What you will be doing:
Design, develop, and maintain system-level software to enable high-performance GPU-to-GPU communication.
Collaborate closely with cross-functional teams including hardware, firmware, system software to build and deliver next-generation GPU networking solutions.
Contribute to scalable and reliable GPU fabric architecture for large compute clusters.
Align software development with customer needs and real-world deployment environments.

Requirements:
B.S/M. S/ Ph.D. in computer science or a related field with 5+ years of relevant experience.
Excellent C/C++ programming and debugging skills, with some familiarity with Python.
Experience writing software applications that interface with device drivers and expose associated hardware functionality.
Solid understanding of computer system architecture, operating system and kernel internals.
Experience with Linux development; familiarity with Windows is a plus.
Background in multi-core / multi-process / multi-threaded programming environment.
Strong understanding of networking fundamentals and high-performance interconnection (e.g., InfiniBand, Ethernet)
Familiarity with OS virtualization technologies like KVM/QEMU/Hyper-V, etc.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.
Ways to stand out from the crowd:
Understanding of CUDA programming model and our company GPUs.
Knowledge of memory coherence and consistency models.
Familiarity with static and dynamic code analysis, fuzzing, negative testing, and other techniques.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8333391

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

26/08/2025

Senior Software Architect, Advanced Development

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

our company has been defining computer graphics, PC gaming, and accelerated computing for more than 25 years. With an outstanding legacy of innovation, driven by phenomenal technology, and extraordinary people, we are looking for a strong technical senior architect to join us in shaping the future. Senior Architects are innovators who can translate business needs into workable technology solutions. Their expertise is deep and broad. They are hands on, producing both detailed technical work and high-level architectural designs.
As a Senior architect in the Advanced Development team, you will explore technological challenges on accelerate networking and building AI data centers. Research new transport functions and semantics for optimizing AI workloads. You will also be leading architectural and development efforts across numerous technological fields, related to the modern data center, such as distributed AI and deep learning solutions, data analytics, High Performance Computing (HPC), Software Defined Networking (SDN), virtualization, storage, and more.
What youll be doing:
Enhance our company's GPU Networking offerings for accelerating AI workloads, such as our company Dynamo or our company NIXL.
Identify and evaluate new technologies, innovations and partner relationships for alignment with our technology roadmap and business value.
Lead architecture and design of such technologies.
Lead proof-of-concept development to evaluate and drive such technologies.

Requirements:
Hold a M.Sc. or Ph.D. in Computer Science, Electrical or Computer Engineering from a leading university (or equivalent experience).
12+ years of industry experience (or equivalent) in systems architecture or related fields.
Experienced in virtualization, networking and storage.
Experienced in either Windows or Linux drivers, with a very good background of the other OS.
Deep understanding of performance profiling and optimization techniques, together with defining and using HW offloads.
A teammate with a can-do attitude, high energy and excellent interpersonal skills.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.
Ways to stand out from the crowd:
Shown research track record.
Have experience and passion for system architecture, CPU/GPU/memory/storage/networking.
Stellar communication skills.
Knowledge in Deep Learning frameworks and AI communication libraries (NCCL, UCX, MPI and equivalents).

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8319698

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

26/08/2025

Senior Software Architect, AI Cloud

חברה חסויה

Location: More than one

Job Type: Full Time

our company seeks a highly motivated senior AI cloud Software architect to join our team in Israel. Work on innovative technologies shaping the future of AI and cloud. We are developing RDMA Transport protocols within the Networking software architecture team at our company. Efficient and fast communication between GPUs directly impacts end-to-end application performance. This impact continues to grow with the increasing scale of next generation systems. This is an outstanding opportunity to advance the state-of-the-art, break performance barriers, and deliver platforms the world has never seen before. Are you ready to build the new and innovative technologies that will help realize our company's vision?
What you'll be doing:
Translate requirements into vision, architecture, and roadmap
Design infrastructure to support programmable improvements to RDMA protocols for E/W AI networking stack
Collaborate with multi-functional teams to innovate and deliver networking solutions
Explore innovative solutions in HW and SW for our next-generation platforms as part of programmable RDMA architecture
Build proofs-of-concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations.

Requirements:
Bachelor's or Masters degree in Computer Science or equivalent experience
8+ years of proven experience in the field
Background in networking stack and protocols such as RDMA, TCP/IP, and InfiniBand
Strong articulation skills for crafting and improving technical documents and the ability to engage with a globally distributed engineering team
Eagerness to learn new technologies and constantly improve your expertise.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8320151

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

01/09/2025

Senior AI Network System Architect

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

Our technology has no boundaries! we are building the worlds most groundbreaking and state-of-the-art accelerated computing platforms. Because of our work, scientists, researchers, and engineers can advance their ideas. We pioneered a supercharged form of computing loved by the fastest-paced computer users in the worldscientists, designers, artists, and gamers.
We seek a highly motivated Senior AI Network System Architect to join our team of experts and help shape the foundational infrastructure for the AI revolution. Our next-generation networking systems are at the forefront of connecting and powering the world's most advanced AI clusters. As a key member of our architecture team, you will be responsible for a wide range of critical activities, from deep technical analysis and performance modeling to strategic architectural studies, ensuring our company continues to innovate and lead.
What Youll Be Doing:
Define, develop, and execute cutting-edge benchmarks and workloads to analyze system performance, identify bottlenecks, and drive optimizations across our hardware and software stack.
Drive the direction of our future products by performing deep-dive analysis of system architectures and solutions to assess their performance, efficiency, and value proposition.
Develop and validate sophisticated performance and network simulation models, correlating them with real-world hardware to predict and analyze the behavior of future systems.
Analyze and optimize the entire AI stack, including communication libraries (like NCCL) and system software to the underlying network fabric, developing Proof-of-Concepts (POCs) for new features and improvements.
Conceptualize next-generation networking architectures driven by emerging DL and AI technologies.
Collaborate with multi-functional teams, including other architecture teams, logic design, system software, firmware, and DL research teams, to ensure the successful execution of our vision.

Requirements:
M.Sc. or Ph.D. degree in Computer Science, Computer Engineering, or Electrical Engineering, or equivalent experience.
6+ years of relevant industry or research experience in high-performance computing, computer architecture, or computer networks.
Excellent understanding of large-scale system behavior and the effect of distributed computing workloads on network and system performance.
Proven experience in simulative performance analysis or benchmarking.
Exceptional analytical, problem-solving, and systems-thinking skills, with the ability to translate complex technical data into strategic architectural insights.
Hands-on programming skills in Python and/or AI frameworks for system analysis, automation, and modeling.
Ability to thrive in a fast-paced, dynamic environment and work concurrently with multiple groups across the organization.
Ways To Stand Out From The Crowd:
Expertise in the architecture and system-level requirements of large-scale, distributed DL workloads (e.g., LLMs, Generative AI for vision).
Deep understanding of communication libraries such as NCCL, UCX, or UCC.
Expertise in network protocols (Ethernet, InfiniBand, RoCE) and large-scale network topologies.
Experience with industry-standard AI benchmarks (e.g., MLPerf) and our company's frameworks (e.g., NeMo) on large-scale clusters.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8327544

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

04/09/2025

Principal Software Architect, GPU Networking Research

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

our company has been defining computer graphics, PC gaming, and accelerated computing for more than 25 years. With an outstanding legacy of innovation, driven by phenomenal technology, and extraordinary people, we are looking for a strong technical principal architect to join us in shaping the future. Principal Architects are innovators who can translate business needs into workable technology solutions. Their expertise is deep and broad. They are hands on, producing both detailed technical work and high-level architectural designs.
As a principal architect in the Advanced Development team, you will explore technological challenges on accelerate networking and building AI data centers as well as research new transport functions and semantics for optimizing AI workloads. You will also be leading architectural and development efforts across numerous technological fields, related to the modern data center, such as distributed AI and deep learning solutions, data analytics, High Performance Computing (HPC), Software Defined Networking (SDN), virtualization, storage, and more.
What youll be doing:
Enhance our company's future GPU Networking offerings for accelerating AI workloads.
Lead vision, architecture and design of such technologies.
Lead proof-of-concept development to evaluate and drive such technologies.
Identify and evaluate new technologies, innovations and partner relationships for alignment with our technology roadmap and business value.
Work with the community and maintainers to drive strategic technologies.

Requirements:
M.Sc. or Ph.D. in Computer Science, Electrical or Computer Engineering from a leading university (or equivalent experience).
15+ years of industry experience (or equivalent) in systems architecture or related fields.
Experienced in virtualization, networking and storage.
Experienced in either Windows or Linux drivers, with a very good background of the other OS.
Deep understanding of performance profiling and optimization techniques, together with defining and using HW offloads.
A teammate with a can-do attitude, high energy and excellent interpersonal skills.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.
Ways to stand out from the crowd:
Shown research track record.
Have experience and passion for system architecture, CPU/GPU/memory/storage/networking.
Stellar communication skills.
Knowledge in Deep Learning frameworks.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8333305

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

01/09/2025

Senior Software Architect, Advanced Development

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

our company has been defining computer graphics, PC gaming, and accelerated computing for more than 25 years. With an outstanding legacy of innovation, driven by phenomenal technology, and extraordinary people, we are looking for a strong technical senior architect to join us in shaping the future. Senior Architects are innovators who can translate business needs into workable technology solutions. Their expertise is deep and broad. They are hands on, producing both detailed technical work and high-level architectural designs.
As a Senior architect in the Advanced Development team, you will explore technological challenges on accelerate networking and building AI data centers. Research new transport functions and semantics for optimizing AI workloads. You will also be leading architectural and development efforts across numerous technological fields, related to the modern data center, such as distributed AI and deep learning solutions, data analytics, High Performance Computing (HPC), Software Defined Networking (SDN), virtualization, storage, and more.
What youll be doing:
Identify and evaluate new technologies, innovations and partner relationships for alignment with our technology roadmap and business value.
Lead architecture and design of such technologies.
Lead proof-of-concept development to evaluate and drive such technologies.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8327863

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

26/08/2025

Senior Software Engineer, Deep Learning Inference

חברה חסויה

Location: Tel Aviv-Yafo

Job Type: Full Time

our company has been at the forefront of the deep learning revolution, pioneering innovations that have transformed the entire field. As the leading provider of GPUs and AI computing platforms, our cpmpany has empowered researchers and engineers worldwide to accelerate breakthroughs in artificial intelligence.
We seek a versatile Senior Software Engineer who is passionate about performance optimization and generative AI. Our team builds software solutions that enable efficient inference on the latest and greatest generative AI models. We tackle problems on all levels of the stackfrom server-level request batching to GPU kernel fusionand collaborate with teams across diverse disciplines to push our company's hardware to its full potential.
What youll be doing:
Cooperate with research teams to onboard new LLMs and VLMs into our company's opensource AI runtimes
Optimize inference workloads using sophisticated profiling and simulation tools
Build SOLID, extendable inference software systems, and refine robust APIs
Implement and debug low-level GPU code to harness the latest HW features
Own end-to-end inference acceleration features and work with teams around the world to deliver production-grade products.

Requirements:
B.Sc., M.Sc. or equivalent experience in Computer Science or Computer Engineering
5+ years of relevant hands-on software engineering experience
Profound knowledge of software design principles
Strong proficiency in at least one system and one scripting language
Strong grasp of machine learning concepts
People person with excellent communication skills that enjoys collaboration and teamwork.
Ways to stand out from the crowd:
Familiarity with our company's DL software stack, e.g. Triton Inference Server, TensorRT-LLM, and Model Optimizer
Proven track record of performance modeling, profiling, debugging, and development in a performance-critical setting with our company's accelerators.
Familiarity with LLM quantization, fine-tunning, and caching algorithms
Proficiency in GPU kernel programming (CUDA or OpenCL)
Prior experience working on a large software project with 50+ contributors.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8320244

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior HPC DevOps Engineer

חברה חסויה

Location: Yokne`am

Job Type: Full Time

we are looking for an experienced HPC DevOps Engineer to help us build the supercomputers and HPC clusters of the future. As a Senior HPC DevOps Engineer, you'll be a key player in groundbreaking advancements in artificial intelligence and GPU computing. Your expertise will drive the latest breakthroughs, providing insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.
What youll be doing:
Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems.
Infrastructure as Code (IaC): Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments.
Streamline CI/CD Pipelines: Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes.
Automate Everything: Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring.
Enhance Monitoring: Deploy advanced monitoring solutions for servers, networks, and storage to ensure seamless operations.
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Lead and Educate: Serve as a technical resource, developing and sharing best practices with internal teams.
Drive Innovation: Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements.

Requirements:
B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience.
Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles.
Familiarity with Jenkins, Ansible, Puppet/Chef.
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security.
Deep understanding of networking protocols such as InfiniBand and Ethernet.
Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes.
Experience with multiple storage solutions like Lustre, GPFS, ZFS, and XFS.
Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix).
Familiarity with cloud platforms (AWS, Azure, Google Cloud).
Ways to stand out from the crowd:
Architectural Insight: Knowledge of CPU and/or GPU architecture.
Container Expertise: Understanding of Kubernetes and container-related microservice technologies.
GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA).
RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321669

שירות זה פתוח ללקוחות VIP בלבד