דרושים » תוכנה » AI and HPC Cluster Group Manager

משרות על המפה
 
בדיקת קורות חיים
אבחון און ליין
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
16/04/2024
משרה זו סומנה ע"י המעסיק כלא אקטואלית יותר
מיקום המשרה: יקנעם ורעננה
סוג משרה: משרה מלאה
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
7 ימים
Location: Yokne`am
Job Type: Full Time
We are looking for a talented Senior HPC and AI Networking Performance Research and Analysis Engineer to join our Performance group.

The ideal candidate will profile and analyze AI workloads on large GPUs and CPUs scale clusters for distributed Deep Learning LLM training focusing at the collectives communication and networking.

You will work and interact with many types of HW and platforms such as HCAs, Switches, CPUs, GPUs, and Systems.

You will experience with and develop performance analysis tools and methodologies to dive deeply into the details, understand performance expectation, limitations, and bottlenecks.

What you'll be doing:

Experience and research AI workloads and DL models specifically tailored for large-scale deep learning LLM training on our supercomputers with a focus on High-performance networking.

Benchmarking, Profiling, and Analyzing the performance to find bottlenecks and identify areas of improvement and optimizations, with a strong emphasis on networking aspects.

Implement performance analysis tools.

Collaborating with many teams from HW to SW to provide performance analysis insights.

Define performance test planning , set performance expectations for new technologies and solutions, and work to reach the performance targets limits.
Requirements:
What we need to see:

B.Sc in Computer Science or Software Engineering.

5+ years of experience with high-performance Networking (RDMA, MPI).

Demonstrated Performance Analysis skills and methodologies.

Experience with our GPUs, CUDA library, deep learning frameworks like TensorFlow or PyTorch, combined with expertise in networking collective communication libraries (such as NCCL) and protocols (such as RoCE and RDMA).

Fast and self-learning capabilities with strong analytical and problem-solving skills.

Programming Languages: Python, Bash and C languages.

Experience with Linux OS distros.

Team player with good communication and interpersonal skills.

Ways to stand out from the crowd:

In-depth knowledge and experience with AI workloads and benchmarking for distributed LLM training.

Knowledge in CUDA, and NCCL libraries.

Knowledge in Congestion Control algorithms.

In-depth System knowledge and understanding (Intel / AMD / ARM CPUs, our GPUs, HCA, Memory, PCI).

Strong Performance Analysis skills and methodologies using modern tools.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7755276
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
27/05/2024
חברה חסויה
Location: Ra'anana
Job Type: Full Time
As a team member you will be responsible for development of enterprise grade software for monitoring and managing the world's largest supercomputers and data centers, built on our GPU and networking hardware. This role offers you an excellent opportunity to deliver production grade solutions, get hands on with ground-breaking technology, and work closely with technical leaders solving some of the biggest challenges in managing large scale networking and computing infrastructure.

What you'll be doing:

The team develops infrastructure for monitoring and gathering telemetry from production environments, running on the worlds largest supercomputers and datacenters.

The work environment is dynamic and challenging; we are innovating and inventing software products at the forefront of technology in terms of performance, scalability, and features.

Our team works closely with other engineering teams to co-design new features and software APIs.
Requirements:
What we need to see:

B.Sc. or equivalent experience in computer science / software engineering.

5 years experience of Programming in Python and C/C++.

3 years experience in Linux environment and tools.

Deep knowledge of Networking Protocols InfiniBand, Ethernet.

Expert knowledge in computer architecture and operating systems.

Experience in performance optimizations.

Ways to stand out from the crowd:

You have positive attitude and work well with others.

Demonstrated use of creative ideas, providing solutions to challenging problems.

Knowledge in RDMA technology.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7737006
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
26/05/2024
Location: Tel Aviv-Yafo and Ra'anana
Job Type: Full Time
Join our education-services team as a dynamic technical trainer and content developer. We're seeking an exceptional individual with expertise in networks, server administration, containerized environments, storage and orchestration tools to lead the delivery and development of technical content related to our AI infrastructure and software solutions. As a key member of our team, you'll collaborate with internal and external experts to design and deliver lab-based training that empowers customers and partners to manage and optimize accelerated computing solutions across diverse workloads, including deep learning, data science, and high-performance computing.

What youll be doing:

Training Delivery: Conduct engaging face-to-face and remote training sessions for our customers and partners, with flexibility for up to 25% travel.

Content Development: Craft cutting-edge training materials aligned with our innovative AI and data center technologies.

Collaboration: Work closely with domain experts to develop course proposals that meet market demands and customer needs.

Lab-Based Training: Design interactive lab exercises enriched with diagrams, videos, and hands-on experiences.

Instructor Support: Develop comprehensive instructor resources, certification criteria, and assessment materials.
Requirements:
What we need to see:

B.Sc. or M.Sc. in computer science, mathematics, engineering or equivalent experience.

Minimum 5 years of professional experience, including at least 2 years in data center-related technologies (networking, server administration, storage, virtualization, accelerated computing).

Hands-on experience administering Linux servers, scripting (BASH, CSH, etc.), and managing storage arrays.

In-depth knowledge and hands-on experience of L2 and L3 networking protocols such as MLAG, BGP, EVPN, VxLAN.

Proficiency in cloud-based Linux environments, and relevant tools (SLURM, Kubernetes, Docker, Git, Python, CUDA, RAPIDS).

Strong presentation and written communication skills in English.

Ability to collaborate effectively across various levels of a matrixed organization.

Ways to stand out from the crowd:

Experience developing hands-on and virtual training for technical audiences.

Familiarity with our AI infrastructure solutions.

Relevant certifications; CCNP or higher, RHCSA, InfiniBand Professional, Cumulus Professional or higher, Certified K8 administrator.

Proven track record of delivering high-quality training workshops to technical.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7735192
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
22/05/2024
חברה חסויה
Location: Tel Aviv-Yafo and Yokne`am
Job Type: Full Time
We are looking to hire a deeply technical, creative, and hands-on advanced-development researcher to prototype and evaluate novel networking technologies for our Ethernet datacenter solutions, focusing on distributed AI training, High-Performance Computing, Cloud, Storage, network programmability, end-to-end network software stacks, and SDN. We are a world leader in high-performance computing technology, AI, and networking with ambitious plans for future systems. This position offers the opportunity to have a real impact in a research-focused team in a dynamic company. The networking advanced-development group is chartered to research and incubate new technologies that will re-define the future data centers and supercomputer performance and functionality. Advanced development software researchers represent us in open-source projects, conferences, and standard bodies.

What you'll be doing:

Developing proof-of-concept implementations of new technologies, and thereafter guiding their incorporation in company products.

Evaluation and analyzing new technologies at scale.

Optimizing collective communications and distributed AI training, and HPC communications. Improving performance on our supercomputers, researching distributed solutions for next-gen data centers networking.

Software development and architecture, ranging from application behavioral analysis and algorithms, network simulations, middleware, and API design, down to implementing OS subsystems, device drivers, FW, and HW modeling.

Customer engagements, academic research collaborations, publishing white papers, blogs, RFCs, and conference lectures/BOFs.
Requirements:
What we need to see:

BSc/MSc/PhDs in Computer science or Electrical Engineering, or equivalent experience.

5+ years of relevant practical experience.

Experienced in Hardware/Software/Firmware integrations.

Able to work independently.

Clear verbal and written communication with the proven ability to build consensus within a large organization.

Ways to stand out from the crowd:

Experience with distributed AI training frameworks, or collective communication libraries.

Practice with innovative network projects and collaborations like SONiC/SAI, P4.

Experience with Ethernet/IP technologies in the data center or edge.

Demonstrated ability to innovate and lead new technologies leading to product impact.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7731364
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
27/05/2024
Location: More than one
Job Type: Full Time
We are spearheading the AI revolution and the creation of state-of-the-art accelerated compute platforms for global utilization. Our Network Modeling and Performance Insights group is seeking a skilled and driven Software Backend Team Lead for the backend design and infrastructure of our simulation and related services. As the infrastructure team lead, You will lead a lean and effective team developing, optimizing, and maintaining our network simulator, enabling the analysis and optimization of AI and High-Performance Computing workloads. You will lead the design of new services backend, empowering our networking insights. You will be responsible for integrating our services with external stakeholders, gathering their requirements, and managing technical relationships. If you thrive on unraveling intricate challenges and steering comprehensive software solutions, we want to hear from you.

What you'll be doing:

Collaborate to optimize the runtime and memory performance of our networking simulation infrastructure. This includes identifying bottlenecks and exploring innovative ideas to improve the simulator performance, to meet growing scale requirments.

Develop and implement algorithms including parallel schemes and including new types of simulations.

Ensure that our services remains robust and reliable under various conditions, and provide good user expiriance.

Integrate the network simulator with various company`s products and tools.

Understand the requirement of the different users of our tools and design micro-service architecture systems to meet their use-cases.
Requirements:
What we need to see:

BSc, MSc or PhD in Computer Science (preferably), Computer Engineering, or a related field equivalent experience.

7+ years of overall relevant practical experience.

3+ years of team leadership or management experience.

Proficiency in C++ and optimization thechinques of C++ code.

Strong computer science fundamentals.

Strong software development skills.

Experience in designing microservice architectures to accommodate varied user needs.

Familiarity with cloud computing and parallelization of computational workloads.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7736713
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
26/05/2024
Location: Yokne`am
Job Type: Full Time
We are looking for a talented Performance Research Engineer to join our Performance group.

The ideal candidate will profile and analyze AI workloads on large GPUs and CPUs scale clusters for distributed Deep Learning LLM training focusing at the collectives communication and networking.

You will work and interact with many types of HW and platforms such as HCAs, Switches, CPUs, GPUs, and Systems.

You will experience with and develop performance analysis tools and methodologies to dive deeply into the details, understand performance expectation, limitations, and bottlenecks.

What you'll be doing:

Experience and research AI workloads and DL models specifically tailored for large-scale deep learning LLM training on our supercomputers with a focus on High-performance networking.

Benchmarking, Profiling, and Analyzing the performance to find bottlenecks and identify areas of improvement and optimizations, with a strong emphasis on networking aspects.

Implement performance analysis tools.

Collaborating with many teams from HW to SW to provide performance analysis insights.

Define performance test planning , set performance expectations for new technologies and solutions, and work to reach the performance targets limits.
Requirements:
What we need to see:

B.Sc in Computer Science or Software Engineering.

5+ years of experience with high-performance Networking (RDMA, MPI).

Demonstrated Performance Analysis skills and methodologies.

Experience with our GPUs, CUDA library, deep learning frameworks like TensorFlow or PyTorch, combined with expertise in networking collective communication libraries (such as NCCL) and protocols (such as RoCE and RDMA).

Fast and self-learning capabilities with strong analytical and problem-solving skills.

Programming Languages: Python, Bash and C languages

Experience with Linux OS distros.

Team player with good communication and interpersonal skills.

Ways to stand out from the crowd:

In-depth knowledge and experience with AI workloads and benchmarking for distributed LLM training.

Knowledge in CUDA, and NCCL libraries.

Knowledge in Congestion Control algorithms.

In-depth System knowledge and understanding (Intel / AMD / ARM CPUs, our GPUs, HCA, Memory, PCI).

Strong Performance Analysis skills and methodologies using modern tools.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7735155
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
21/05/2024
Location: Tel Aviv-Yafo and Yokne`am
Job Type: Full Time
We are seeking a Sr. Mgr in Israel for our Enterprise Networking team organization within the IT. Infrastructure organization. In this role, you will be responsible for building, designing and supporting our global networks, ensuring reliability, scalability and efficiency goals are defined and met. You will be leading a team of passionate network engineers to bring in a data driven approach to operations, with focus on observability, well defined success metrics, and making continuous improvements. The successful candidate would be able to leverage their network architecture and design skills along with an execution mindset to facilitate effective translation of strategic plans to incremental delivery of impact to the business. Structured thinking and problem-solving skills, along with exceptional communication abilities will be crucial for success in this role as you build strong teams that partner with engineering and operations teams across our company.

What you will be doing:

Your main focus will be building and deploying networks to rebuild our current IL networks and also scale it for the exponential growth we are going through.

Your focus will also be maturing the current support model and processes to a more data driven, automated, SRE model. Our users will be your primary focus and enabling them!

Build an in-house team of experts for networking with both design and architecture skills and also support and operations from the existing outsourced SMES , providing leadership, direction, and strategy for a growing team.

Set the technical vision, strategy, and roadmap for network operations in partnership with the key infrastructure and partner teams.

Work across Network Architecture, Network engineering and partner well to establish run books, regular training sessions and ensure we build the network to be self-healing.

Work very well in understanding RCAs from events and incidents and work with our AI operations to enrich our observability tooling for better full stack view of the network to applications.

Influence the architecture of the company`s networks both on-prem and in the clouds.
Requirements:
What we need to see:

Bachelors degree in Computer Science, related technical field, or equivalent experience.

10+ overall years of experience with system design, network architecture, network engineering, and network operations and 6+ years Leadership experience.

Experience building and growing teams that are geographically distributed, appreciate local operations and bring in a global perspective, following standards.

Ability to do technical deep-dives into code, networking, operating systems and storage, and verbally and cognitively agile enough to hold your own in a strategy discussion with our executive team and peer SMEs.

Ability to identify trends and promote solutions that solve challenges efficiently across multiple product areas.

Excellent innovative thinking, collaboration, and problem-solving skills.

Ways to stand out from the crowd:

Experience transforming network operations using software driven methods.

Background in a Hyperscale Cloud Service Provider (public facing or not).

Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc).

Knowledge of software interface design & documentation for less technical end-users.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7729885
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
22/05/2024
Location: Yokne`am
Job Type: Full Time
We are looking for a brilliant Software Engineer in the System Production Engineering group. You will be part of a team that shapes the next generation of system production solutions for NICs, Smart NICs/DPUs and Network Switches. You will be working close with team engineers, hardware architects, software architects, R&D teams, and external parties such as production lines. You will lead the development of our products' customer customization and raise the bar of the software development quality in the group. We have crafted a team of extraordinary people, whose mission is to push the frontiers of what is possible today and define the platform of tomorrow.

What you will be doing:
Develop test automation for our networking products, which requires an understanding of HW and SW to provide stable, efficient and robust production tests, to enable high availability, while ensuring the quality of the products being shipped to customers.
Utilize test suites to find, debug and resolve problems with production process.
Work in multi discipline environment to provide robust solutions and drive defects to resolution.
Generate statistics about code quality, complexity, and coverage.
Requirements:
What we need to see:
BA/BSc degree in Computer Science, Computer Engineering or Electrical Engineering (or related/equivalent degree).
5+ years of work experience in a software development and proven programming experience in Python.
Proficient in Windows and Linux operating systems.
Ability to drive projects to full execution in time and working under pressure of schedule and multi project environment.
Team player, highly motivated always stay up-to-date with new technologies and test methodologies.
Excellent verbal and written communication, both in Hebrew and English.

Ways to stand out from the crowd:
Experience with Software/Hardware products integration and HW lab measurement equipment.
Familiarity with product manufacturing flows and processes.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7731409
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
23/05/2024
Location: More than one
Job Type: Full Time
We are seeking a highly innovative and ground-breaking Senior Software Architect to spearhead the advancement of AI data centers and networks. As the driving force behind this role, you will have the opportunity to craft the future of our end-to-end AI Cloud solution, revolutionizing areas such as optimized host virtualization, ground-breaking Fabric technologies, and seamless orchestration. In addition to leading these initiatives, you will also serve as a influential ambassador for NVIDIA, representing us in prominent open-source projects, influential conferences, and standard bodies. This is an outstanding chance to create a lasting impact on the world of AI and advance forefront of technological advancements.

What you'll be doing:

Identify and evaluate new technologies, innovations and partner relationships for alignment with our technology roadmap and business value.

Lead design of new networking applications using data plane programming to solve sophisticated networking problems in innovative ways.

Outbound work includes customer engagements, publishing white papers, blogs, RFCs, and conference lectures/BOFs.

Help defining a strategic vision for our networking together with adjacent software and hardware architects.
Requirements:
What we need to see:

M.Sc. or PhD. in Computer Engineering, Computer Science or Electrical Engineering, or equivalent experience.

12+ years of practical experience.

Expert level knowledge in Ethernet/IP technologies.

Experience in programming network data path.

Clear verbal and written communication with the proven track record to build consensus within a large organization.

Ways to stand out from the crowd:

Proven track record to prototype ideas and demonstrate their value.

Experienced in system software design and operating system fundamentals.

Background in cloud or data-center technologies.

Experienced with distributed AI technologies or HPC.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7733194
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
נאספה מאתר אינטרנט
23/05/2024
חברה חסויה
Location: Ra'anana
Job Type: Full Time
We are looking for a talent Software Engineer to join our Ethernet Switch Network OS.

At NVIDIA, we have amazing GPUs that power AI applications, but they also require a high-performance network to support them. As a team member, you will have the opportunity to create innovative software that optimizes AI networks for the best performance possible. In this position you will take part in our large worldwide community, contributing new features, bugs fixes and have NVIDIA Switch products running with our NOS in production in different clusters over the world.

What you'll be doing:

Design and implement features as part of the company`s release train on top of NVIDIA Switch products.

Be part of our R&D team, contribute code to our worldwide community.

Work in a Continuous Deployment environment of fast development/deployment cycles.

Work with experienced teams which are well known in the community.
Requirements:
What we need to see:

B.Sc. degree in Computer Science or equivalent experience.

5+ overall years of experience in technical software development.

Experience in C++ and Python programming on top of Linux operation system.

Fast and self learner with outstanding communication and technical skills.

Motivated, responsive, and keen on process improvement.

Ways to stand out from the crowd:

Experienced in software development on open source project.

Experienced in Networking protocols: L2 and L3.

Background in Linux shell scripting.

Scrum methodology and active scrum master.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
7733130
סגור
שירות זה פתוח ללקוחות VIP בלבד