דרושים » הנדסה » Senior HPC Site Reliability Engineer

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
4 ימים
Location: Yokne`am and Tel Aviv-Yafo
Job Type: Full Time
We are now looking for a Senior HPC Site Reliability Engineer to join our mission and continue improving our HPC infrastructure. A meaningful part of NVIDIAs strength is our unique and advanced development tools and environments that enable our incredible pace of innovation. We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.

What you will be doing:
Provide leadership in the design and implementation of our large-scale compute cloud that enables the world's top chip modelers, designers, and deep learning experts to invent groundbreaking technology.
Identify architectural changes or completely innovative approaches in our cloud architecture and design.
Help with strategic challenges we encounter, including: effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and planning for multi-year growth and scaling across our global computing environment!
Requirements:
What we need to see:
B.sc in Computer Science, Electrical Engineering or related field or equivalent experience
8+ years of experience designing and operating large scale compute infrastructure.
Experience with job schedulers such as IBM/Platform LSF, SGE, SLURM, Marathon, Chronos.
Solid understanding of cluster configuration managements tools - Ansible, Puppet, Chef, Salt.
Good experience providing compute services using a public cloud (AWS, Azure, Google Cloud)
Strong script-writing skills: Python, Bash, Perl
Knowledge of and/or experience deploying PaaS microservices - Docker, Docker Swarm, Kubernetes
Understanding of fast distributed and network attached storage solutions and Linux file systems, ability to recommend and implement solutions to improve OS performance and reliability.

Ways to stand out from the crowd:
Linux certification from a well-known vendor - RedHat, Oracle, etc.
Prior experience managing large-scale Kubernetes deployment in production.
Strong skills in modern container networking and storage architecture.
Well-known Cloud Certification(s).
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8586553
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 1 שעות
Location: Yokne`am
Job Type: Full Time
we are now looking for a senior hpc site reliability engineer to join our mission and continue improving our hpc infrastructure. a meaningful part of nvidias strength is our unique and advanced development tools and environments that enable our incredible pace of innovation. we are looking for architects to help us evolve the way our private compute cloud is architected and optimized. 
what you will be doing: 
provide leadership in the design and implementation of our large-scale compute cloud that enables the world's top chip modelers, designers, and deep learning experts to invent groundbreaking technology.
identify architectural changes or completely innovative approaches in our cloud architecture and design. 
help with strategic challenges we encounter, including: effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and planning for multi-year growth and scaling across our global computing environment!
Requirements:
what we need to see: 
b.sc in Computer Science, electrical engineering or related field or equivalent experience
8+ years of experience designing and operating large scale compute infrastructure.
experience with job schedulers such as ibm/platform lsf, sge, slurm, marathon, chronos.
solid understanding of cluster configuration managements tools - ansible, puppet, chef, salt.
good experience providing compute services using a public cloud (aws, azure, google cloud)
strong script-writing skills: Python, bash, PERL
knowledge of and/or experience deploying paas microservices - docker, docker swarm, kubernetes
understanding of fast distributed and network attached Storage solutions and Linux file systems, ability to recommend and implement solutions to improve os performance and reliability.
ways to stand out from the crowd:
Linux certification from a well-known vendor - redhat, Oracle, etc.
prior experience managing large-scale kubernetes deployment in production.
strong skills in modern container networking and Storage architecture.
well-known cloud certification(s).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8593706
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
18/03/2026
חברה חסויה
Location: Tel Aviv-Yafo and Yokne`am
Job Type: Full Time
We are now looking for a HPC Operations Engineer to join our mission and continue improving our HPC infrastructure. A meaningful part of ourstrength is our unique and advanced development tools and environments that enable our incredible pace of innovation. We are looking for architects to help us evolve the way our private compute cloud is architected and optimized.

What youll be doing:

Troubleshoot incoming support requests in a large-scale HPC environment.

Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation.

Ensure compute servers are running correct Operating System and configuration.

Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

Collaborate with specialist teams to drive issues to closure.

Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.

Directly contribute to the overall quality and improve time to market for our next generation chips.
Requirements:
What we need to see:

BS in Computer Science or similar degree or equivalent experience

2+ years of experience Proficient in administering Centos/RHEL Linux distributions.

Understating of container technologies like Docker.

Proficiency in Python and UNIX scripting languages such as bash.

Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.

Solid understanding of cluster configuration managements tools such as Ansible.

Ways to stand out from the crowd:

Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors.

Familiarity with job scheduler administration (e.g. IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure.

Knowledge of the FlexLM license management system.

Proficiency in Perl for maintaining legacy automation scripts.

Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8583522
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
4 ימים
חברה חסויה
Location: Yokne`am
Job Type: Full Time
Join our team as a Senior HPC software engineer. you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.

What you'll be doing:

Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

Continuously improve solution provisioning and management through automation.

Detect performance issues and recommend solutions to maintain world-class service quality.

Conduct capacity management and planning to meet ongoing operational needs.

Participate in incident reviews, assist in root cause identification, and write RCA reports.

Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

Participate in the team's on-call rotation.
Requirements:
What we need to see:

B.S. degree in Computer Science or related technical field (or equivalent experience).

8+ years in building and supporting critical services.

5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

Full-stack AI experience with deep expertise in MCP ecosystems, Carpenter, n8n orchestration, and AI-assisted development via Cursor.

Expertise with at least one major cloud service provider - AWS, GCP, Azure.

Demonstrated proficiency with end-to-end SRE capabilities and observability.

Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

Linux certification from a well-known vendor - RedHat, Oracle, etc.

Prior experience managing large-scale Kubernetes deployment in production.

Strong skills in modern container networking and storage architecture.

Hands-on background working with Flexlm and license management system.

Hands-on experience working with Slurm/LSF environments.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8586486
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 4 שעות
Location: Yokne`am
Job Type: Full Time
we are looking for a senior hpc and ai cluster administrator to join the networking clusters solutions hpc/ai infrastructure team. we are building supercomputers and ai clusters based on groundbreaking technologies. we are looking for a system administrator to be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and gpu computing
you will work with the latest accelerated computing and deep learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. you will interact with hpc, os, gpu compute, and systems specialist to architect, develop and bring up large scale performance platforms. does this sound like you? if so, we would love to hear from you!
what you will be doing: deploy, manage and maintain large scale hpc/ai clusters
managing Linux job/workload schedules and orchestration tools
support and maintain continuous integration and delivery pipelines
troubleshooting and fixing, bottom up from bare metal, operating system, software stack and application level
supporting research & development activities and engaging in pocs/povs for future improvements.
Requirements:
what we need to see: bachelor's degree in Computer Science, engineering, or a related field; or equivalent experience
5+ years of experience
knowledge of hpc and ai solution technologies from cpus and gpus to high speed interconnects and supporting software
experience with job scheduling workloads and orchestration tools such as slurm, k8s
excellent knowledge of windows and Linux (redhat/centos and ubuntu) networking (sockets, firewalls, iptables, wireshark, etc.) and internals, acls and os level security protection and common protocols e.g. tcp, dhcp, dns, etc.
experience with multiple Storage solutions such as lustre, gpfs, zfs and xfs. familiarity with newer and emerging Storage technologies.
Python programming and bash scripting experience, automation and configuration management tools such as jenkins, ansible, gitops
knowledge of networking protocols like infiniband, ethernet
experience with virtual systems (for example VMware, hyper-v, kvm)
familiarity with cloud computing platforms (e.g. aws, azure, google cloud)
ways to stand out from the crowd: knowledge of cpu and/or gpu architecture
knowledge of kubernetes, container related microservice technologies
experience with gpu-focused hardware/software (dgx, cuda)
background with rdma (infiniband or roce) fabrics
our company has been redefining computer graphics, pc gaming, and accelerated computing for more than 25 years. we have a unique legacy of innovation thats fueled by great technology-and amazing people. today, were tapping into the unlimited potential of ai to define the next era of computing. an era in which our gpu acts as the brains of computers, robots, and self-driving cars that can understand the world. doing whats never been done before takes vision, innovation, and the worlds best talent. our teams are composed of driven, innovative professionals dedicated to pushing the boundaries of technology. we offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. as an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all
#il-hybrid
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8593421
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
4 ימים
Location: Yokne`am and Tel Hai
Job Type: Full Time
We are looking for a Senior Software Engineer to join NSV tools (Network Solutions Validation) group. As a senior team member, you will be part of a development effort of high-performing software automation systems for our Data Center environments. You will interact with NIC, OS, Switch, HCA, CPU and GPU compute as well as architects, network engineers, and developers. We drive the data growth of the worlds biggest companies. With talented engineers around the globe, the work environment is dynamic, meaningful, and fast-paced. Are you ready for the challenge?

What youll be doing:

Design and develop an automation platform used to provision, configure, and monitor HPC data centers.

Implement scalable, reliable, and maintainable services that enhance cluster visibility and improve operational efficiency.

Collaborate closely with internal and external stakeholders to understand requirements and deliver robust full-cycle solutions.

Improve stability and performance across the provisioning pipeline through architectural enhancements and code optimizations.

Troubleshoot issues in distributed environments and contribute to system observability and reliability improvements.

Work cross-functionally with architects, DevOps engineers, product managers and stakeholders to ensure high-quality releases.

Participate in code reviews, technical design discussions, and continuous improvement activities within the team.
Requirements:
What we need to see:

B.Sc. in Computer Science, Engineering, or a related field (or equivalent practical experience).

5+ years of strong hands-on experience on Linux-based platforms.

Proficient scripting and automation skills (Bash, Python, Ansible).

Background in DevOps and Network Engineering practices.

Hands-on experience with large-scale network architectures, switches/routers, OVS, SR-IOV, and network operating/management systems.

Networking expertise: Ethernet, VLANs, TCP/UDP/IP, QoS, L2/L3 protocols, BGP, EVPN/VXLAN, and common network topologies.

Practical experience with containers and cloud-native technologies (Docker, Kubernetes) and networking performance.

Experience with version control systems (Git) and CI/CD pipelines.

Independent, fast learner with strong ownership mindset, excellent debugging and problem-solving skills, and effective communication abilities.

Ways to stand out from the crowd:

Experience as Team Lead/ Scrum master or similar leadership role.

Experience in planning, tracking, and delivering projects.

Familiarity with DevOps methodologies and tools (e.g., Jenkins, Ansible).

Hands-on experience with Docker and containerized environments.

Experience with agentic AI development.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8586566
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
02/03/2026
Location: Yokne`am
Job Type: Full Time
The Networking Advanced Development Software team develops new groundbreaking technologies to enable new market shares for the company and tighten customer relationships. These are emerging technologies in networking and distributed computing for the booming AI factories and data centers. They span areas such as AI neural networks, Deep Learning, High Performance Computing (HPC), Storage, Cloud, SW Defined Network, Network Function Virtualization and more. We develop the solutions top-down, all the way from application behavioral analysis, to architecture definition and down to the implementation, using the world-leading our devices. The development traverses any needed component - application SW, middleware SW, OS kernel subsystems, device drivers, embedded SW (Firmware) and CUDA GPU. We collaborate with partners and key customers in the analysis processes and engage with open source communities introducing our leading features.

What youll be doing:

Design and implement solutions throughout all layers from high level application, OS and driver subsystem to firmware.

Work on impactful projects involving state-of-the-art high-performance computing hardware and software.

Provide insight and technical guidance and collaborate with peers from across the company - including software architecture, chip architecture, and engineering departments to improve our future technology.

Collaborate with our partners and customers.
Requirements:
What we need to see:

B.Sc. in Computer Science, Electrical Engineering, Computer Engineering, or a related field.

5+ overall years of industry experience in system programming or related fields.

Understanding of multi core hardware, operating systems design, concurrency, virtual memory, caching, interrupts, device drivers, real-time

Excellent programming skills.

Ability to learn complex concepts in a fast pace environment.

A teammate with a can-do attitude, high energy and excellent interpersonal skills.

Ways to stand out from a crowd:

Familiarity with networking protocols.

Hands-on experience with CUDA programming and GPU acceleration.

Hands-on experience with LLM serving frameworks.

Experience with open-source projects (coursework, personal, or contributions).

Working in a fast-paced and dynamic environment.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8566056
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 47 דקות
Location: Yokne`am
Job Type: Full Time
we face with operating at scale to produce a best-in-industry solution and enable us to continue to provide unprecedented performance and reliability for our users. you will take on the challenges that come with operating and scaling our ever-growing gitlab deployment. in this role, you will design and code processes and automation tools to improve productivity managing and administering the scm systems and applications used by our globally distributed engineering teams.
what you'll be doing:
responsible for the full scm environment including application, os, and server hardware components, developing the continued automation and innovation needed for our large environment
create new solutions to improve the reliability and performance of our ever-growing infrastructure, and work with automated orchestration tools to deploy those improvements to hundreds of systems worldwide
be part of a global team and will evaluate technology alternatives, work closely with other project members to specify solutions, craft schedules, and lead ongoing enhancements and support
lead or contribute to gitlab upgrades and migrations, architectural reviews and design docs, root cause analysis and systemic fixes
learn and greatly improve the daily productivity of the worlds top chip designers and software engineers
Requirements:
what we need to see:
ms (preferred) or bs in Computer Science (or equivalent experience) or a related field with at least 5+ years of experience
deep understanding of scm processes and large-scale, multi-site gitlab environments (experience with other scm tools such as perforce, subversion, or clearcase is a plus).
you've configured/deployed continuous integration (ci) and continuous deployment (cd) systems in your past experience
excellent interpreted language skills highly desired - object oriented PERL or Python preferred, and strong software engineering process skills required
strong skills in scripting and object-oriented languages such as Python or ruby, with solid software engineering practices and familiarity with design patterns.
hands-on experience with relational databases (postgresql preferred)
experience with DevOps or system administration on Linux systems required (rocky Linux 8, centos/rhel, and ubuntu preferred)
strong experience with automation required, ansible or puppet preferred and excellent interpersonal skills, including written and verbal communication
you are comfortable and enjoy working with dynamic and ever evolving environments
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8593801
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
our company's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to our companys needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.
our company Cloud Storage provides a suite of services like the scalable our company Cloud Storage (GCS) for unstructured data, high-performance managed lustre for AI High-Performance Computing workloads, and the managed Network File System (NFS) service filestore. These offerings are fundamental to our company Cloud's infrastructure, serving as the backbone for storing and accessing the vast datasets required for training and deploying AI models. In the AI era, this storage portfolio is crucial for enabling innovation, supporting compute-intensive tasks, and driving significant value for our company Cloud customers. The performance, scalability, and flexibility of these storage solutions are key to unlocking the potential of AI.
our company Cloud accelerates every organizations ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage our companys technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to our company Cloud as their trusted partner to enable growth and solve their most critical business problems.
Responsibilities
Design and implementation of high-complexity features. Deliver robust, production-ready code for storage solutions that address the specific demands of AI/ML workloads.
Architect scalable and performant storage solutions. Make data-driven decisions to optimize system efficiency and reliability.
Provide deep-dive code reviews and technical guidance to the team. Set the standard for engineering excellence and help maintain a high quality bar for all deliverables that addresses the expectations of key our company customers.
Identify and resolve performance bottlenecks and intricate system issues. Develop innovative, practical solutions to technical challenges that arise at the intersection of storage and ML.
Partner with product and engineering stakeholders to translate customer requirements into technical specifications, ensuring the incubations output aligns with broader our company Cloud goals.
Requirements:
Minimum qualifications:
Bachelors degree or equivalent practical experience.
8 years of experience in software development.
5 years of experience with ML design and ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).
5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
Preferred qualifications:
Masters degree or PhD in Engineering, Computer Science, or a related technical field.
8 years of experience with data structures and algorithms.
3 years of experience in a technical leadership role leading project teams and setting technical direction.
3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8545279
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 45 דקות
Location: Tel Aviv-Yafo
Job Type: Full Time
we are looking for a 100% hands-on Storage services software engineer to join the block Storage group. you will be a member of a team that builds the next generation block Storage capabilities. you will work closely with a variety of teams and architects including the networking team, and external customers. you will take part in defining the software architecture and implementation of the most advanced Storage services! services that will need to meet extreme performance and scalability demands! we have crafted a team of extraordinary people stretching around the globe, whose mission is to push the frontiers of what is possible today and define the platform of tomorrow.
we work, think and learn as a team. we thrive in a deeply strong environment, and we're passionate about a culture that demands innovation and the highest standards. the rewards are sweet and include collaborating with some of the smartest people in the industry, an aggressive compensation plan that rewards top performers, and the opportunity to work on products that transform the way people work and play.
what youll be doing:
100% hands-on coding role in C language, Kernel and userspace
research, design, implement and TEST, new and existing, distributed Storage services and features of nvidias block Storage solution, in both host and dpu environments.
acquire understanding of the algorithms, the technicalities and the interaction with other components across nvidias block Storage ecosystem.
analyze and solve challenging bugs and customer cases in large scale production systems, identifying issues in our or inbox Kernel modules and often in other components. drive new solutions based on any issues that arise
Requirements:
what we need to see:
b.sc., m.sc.. in Computer Science, electrical engineering or related discipline (or equivalent experience).
15+ years of experience as a senior Developer, preferably in the domains of Storage, networking, and/or operating-systems.
strong proficiency in C / C ++ programming.
experience with Storage protocols and standards, especially nvme
experience with Linux block subsystem and io stack
proven professional experience in designing and developing distributed systems; advantage for experience in block Storage and/or networking systems.
ability to work autonomously, with a proactive mindset and perseverance to solve day to day challenges.
ability to quickly adapt to new technology and go deep into new areas
excellent communication skills and a collaborative mindset.
innovative approach, identifying opportunities to improve, accelerate, and reuse existing solutions.
knowledge of cloud computing concepts, including virtualization, scalability, and data management.
ways to stand out from the crowd:
Linux Kernel coding experience
Linux Kernel internals knowledge including memory management, scheduling, etc.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8593806
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
18/03/2026
Location: Yokne`am
Job Type: Full Time
We are looking for a Senior networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. NVIDIA is widely considered to be one of the technology worlds most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

What youll be doing:

Design and review test and product requirements across the Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior.

Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.

Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.

Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation.

Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.

Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces.

Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders.

Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.
Requirements:
What we need to see:

B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience.

5+ years of hands‑on networking or system‑level testing and debugging on Linux.

Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2).

Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure.

Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).

Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging.

Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes.

Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment orchestration.

Fast learner, familiar with modern AI tools and workflows, able to adapt quickly.

Excellent analytical, problem‑solving and communication skills, with strong ownership and a collaborative mindset.

Ways to stand out from the crowd:

Hands‑on debugging of collective communication libraries (for example NCCL) or large‑scale LLM training / inference clusters.

Experience with large cluster environments (tens to thousands of GPUs or nodes), including incident response and post‑mortem analysis.

Deep expertise in tuning and debugging congestion control and lossless Ethernet for AI workloads (for example DCQCN, ECN, PFC).

Familiarity with NVIDIA networking technologies (for example BlueField / BF3, ConnectX NICs) and their software stack and diagnostics.

Experience debugging issues that span multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking / AI systems.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8584095
סגור
שירות זה פתוח ללקוחות VIP בלבד