Senior SRE Engineer

עדכון קורות החיים לפני שליחה

8321700

שירות זה פתוח ללקוחות VIP בלבד

משרות דומות שיכולות לעניין אותך

דיווח על תוכן לא הולם או מפלה

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

28/08/2025

Senior DevOps Engineer

חברה חסויה

Location: Tel Aviv-Yafo and Ra'anana

Job Type: Full Time

our company has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. Its an outstanding legacy of innovation thats motivated by extraordinary technology and amazing people. we are looking for a highly motivated DevOps/SRE engineer to join the company AIR team the Digital Twin for Data Center Simulation web application. our company Air enables cloud-scale efficiency by creating identical replicas of real-world data center infrastructure deployments.
What you'll be doing: 
The person will be part of the company AIR team that is building the SaaS/IaaS platform for digital twin of AI data centers.
The responsibility specifically is for DevOps, infrastructure and Site Reliability Engineering (SRE) requirements for AIR.
Focus on efficiency by automating repetitive workflows.
Working on microservices based architecture.
Deploying and troubleshooting non-disruptive cloud operations with an emphasis on secure production infrastructure.
Continuous evaluation of existing system and driving improvements.
Managing deployment/upgrade for Operating Systems, Kubernetes(k8s) clusters and/or or other orchestration tools.
Day to day support for engineering activities with CI/CD tools like git, Jenkins.
Efficiently multi-tasking on the different tracks to efficiently address evolving priorities .

Requirements:
5+ years of experience in complex microservices based architectures
Highly skilled in Kubernetes and Docker
Experience in IaaS environment - deploying, configuring, and administering Linux-based bare metal servers
Strong networking background (VLANs, routing, VPNs)
Experience with relational databases(MySQL) and SQL.
Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts
Infrastructure as code (IaC) skills in frameworks like Ansible & Terraform
Expert in AWS
Knows best practices and discipline of managing and monitoring a highly available and secure production infrastructure
Ways to stand out from the crowd: 
Strong expertise in Infrastructure as a Service (IaaS)
Skills in Linux/Unix Administration
Experience with Prometheus/Grafana.
Experience with APM tools like Dynatrace, Datadog, AppDynamics, New Relic, etc.
Implemented robust metrics collection and alerting infrastructure.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8324028

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

26/08/2025

Senior Software Engineer - Infrastructure

חברה חסויה

Location: Tel Aviv-Yafo and Ra'anana

Job Type: Full Time

we are searching for a highly motivated software engineer for the company NetQ team that is building a next gen Network management and Telemetry system in cloud using modern design principles at internet scale. our company NetQ is a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus fabrics in real time. NetQ utilizes telemetry and delivers actionable insights about the health of your data center network, integrating the fabric into your DevOps ecosystem.
What you'll be doing:
Building and maintaining infrastructure components like NoSQL DB (Cassandra, Mongo), TSDB, Kafka etc
Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing tools and enabling automations for redundant manual workflows via Jenkins, Ansible, Terraforms etc
Enable performing scans and handling of security CVEs for infrastructure components
Enable triage and handling of production issues to improve system reliability and servicing for customers.

Requirements:
Bachelors degree or equivalent experience
5+ years of experience in complex microservices based architectures.
Highly skilled in Kubernetes and Docker/containerd.
Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts.
Automation expert with hands on skills in frameworks like Ansible & Terraform.
Strong knowledge of NoSQL DB (preferably Cassandra), Kafka/Kafka Streams and Nginx.
Expert in AWS, Azure or GCP.
Having good programming background in languages like Go, Java or Python.
Knows best practices and discipline of managing a highly available and secure production infrastructure.
Ways to stand out from the crowd:
Experience with APM tools like Prometheus, Grafana, Dynatrace, Datadog, AppDynamics, New Relic, etc.
Skills in Linux/Unix Administration.
Implemented highly scalable log aggregation systems in past using ELK stack or similar.
Hands-on experience configuring and managing the OpenTelemetry (OTel) Collector, including OTLP protocol setup and integration with observability backends.
Implemented robust metrics collection and alerting infrastructure.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8320283

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

20/08/2025

Senior Software Engineer

חברה חסויה

Location: Tel Aviv-Yafo

Job Type: Full Time

We are looking for a talented highly-motivated experienced SW engineer to join one of its growing inspiring development teams.
You will work on multi-tenant, high-scale, distributed SaaS echo system on top of k8s platform which is used for managing the cloud security services infrastructure, customers' self-service configuration, monitoring and reporting, analytics and more.

As a SW engineer, you will manage and work with the different Engineering teams and architects in order to design, develop, monitor, scale and optimize the large-scale architecture of a winning SaaS security service.

What will you do?

Implement our implementation of next generation back-end infrastructure to help us scale our SaaS based infrastructure.

Be part of a team building tools to make our infrastructure scalable, and robust.

Leverage Generative AI tools for code generation, optimization, debugging, documentation, and prototyping.

Continuously research and integrate new AI-driven developer productivity tools.

Design and develop an always-available Cloud-based SaaS platform in AWS

Lead and Design the development of robust CI/CD pipelines for Kubernetes running Containerized applications

Design and build strong Application and System monitoring and automated self-healing procedures.

Maintain and support application deployments, building new systems and upgrading existing.

Working closely with all the Engineering and DevOps teams, taking full responsibility and ownership from conception to post-deployment in a collaborative, fast-paced environment.

Requirements:
6+ years of experience in infrastructure and Backend SW development roles.

Experience managing infrastructure on AWS.

Experience with architecture methodologies and paradigms like micro-services, distributed systems and more.

Experience integrating and actively using GenAI tools (e.g., GitHub Copilot, Claude, ChatGPT etc) in daily development must.

Open-minded to new workflows and AI-driven innovation.

An agile/DevOps way of thinking.

Experience with CI/CD tools (Jenkins, argot, Nexus and similar).

Experience with the K8S platform and tools (Helm charts and similar).

Experience with the following technologies/tools/fields: Elasticsearch , Clickhouse, Messaging (Kafka,NATS,, Redis etc), Monitoring and Visibility (Prometheus, Grafana, loki, etc).

Programming languages Golang/ Java.

Functioning well under pressure.

Strong problem-solving ability and a "Can-do approach".

Working in an agile environment.

Excellent communication and interpersonal skills.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8312368

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior Software Engineer, DOCA

חברה חסויה

Location: Ra'anana and Yokne`am

Job Type: Full Time

We are looking for a Senior Software Engineer. You will work with highly experienced engineers to provide the world's outstanding SmartNIC products for cloud-computing, research, medical, automotive, finance, weather, telco, and more. We are developing some of the core libraries of the company DOCA SDK, rapidly growing DOCA functionality and use cases. With DOCA, developers can program the data center infrastructure by creating software-defined, cloud-native, secured, HW-accelerated services.
We also take significant part in the Linux-foundation DPDK (dpdk.org) project, and expand the company-Mellanox PMD in particular, providing the framework and common API for fast packet processing in user space. Our goal is to enable breakthrough network performance, using our companySmartNIC hardware capabilities and address the performance, scale and security demands of modern software-defined enterprise data centers and public cloud infrastructure.
What you'll be doing:
You will architect, design, and develop the next-generation technology in network acceleration, as well as work with best-in-class technical leaders in this domain
Engage with customers and architects to understand the requirements and derive the software design accordingly
Collaborate with other engineering teams that develop the upper layers applications like virtual switches (OVS, VPP, and etc.) and lower layers like driver, kernel, FW, and HW.

Requirements:
B.Sc. (or equivalent experience) in computer science/software engineering
5+ years confirmed experience of Programming C/C++
5+ years confirmed experience in Linux environment and tools
Deep experience with Networking Protocols mainly Ethernet, and security protocols
Experience with virtualization technologies
Strong analytical, debugging, and problem-solving skills
Deep knowledge of computer architecture and operating systems.
Experience in performance optimizations
Ways to stand out from the crowd:
Knowledge and experience in DPDK
Knowledge and experience with designing SDKs
Open Source Software Contributor to relevant projects (OvS, DPDK, Linux Kernel..)
A positive demeanor, a growth mindset, and excellent interactions with colleagues.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321760

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

04/09/2025

Senior Software Engineer, AI Platform

חברה חסויה

Location: Tel Aviv-Yafo

Job Type: Full Time

Run:ai, now part of our company, has evolved AI infrastructure by merging GPU virtualization with Kubernetes-native capabilities. Our world class AI platform allows organizations to improve productivity and efficiency for data scientists and machine learning engineers. With deep Kubernetes expertise and a focus on innovation, we are dedicated to developing cutting-edge technologies, delivering the best user experience for our customers, and providing deep visibility into workload performance through rich metrics that help users optimize their AI workloads. We are looking for highly skilled software engineers to join our Platform Group and help shape the future of AI infrastructure.
The role of a Senior Software Engineer in the Platform Group is to design and develop scalable, high-performance systems that support the next generation of AI workloads. You will collaborate with experts across domains, tackle complex challenges, and drive innovations that empower our users to push the limits of AI capabilities.
What youll be doing:
Designing and developing enterprise-grade systems with a strong focus on scalability, reliability, and performance.
Building and optimizing microservices-based architectures using Kubernetes and cloud-native technologies.
Collaborating closely with backend engineers, product managers, and other collaborators to deliver impactful solutions.
Writing clean, maintainable, and testable code in Go
Conducting code and design reviews to uphold high-quality standards and mentor team members.

Requirements:
B.Sc. in Computer Science or a related field.
5+ years of experience in backend software development, including system design and architecture.
Proficiency in at least one backend programming language (We write in Go).
Strong understanding of microservices architecture, RESTful APIs, and relational databases.
Deep familiarity with Kubernetes and the cloud-native ecosystem.
Demonstrated ability to tackle complex technical challenges and deliver high-quality solutions.
Ways to stand out from the crowd:
Expertise in Kubernetes internals and advanced cloud-native technologies.
Hands-on experience with HPC or AI/ML platforms.
Familiarity with AI inference workloads and performance optimization.
Proficiency in Linux, with knowledge in networking, security, storage, and virtualization.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8333587

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

02/09/2025

Senior Software Engineer

חברה חסויה

Location: Tel Aviv-Yafo

Job Type: Full Time

Run:ai, now part of our company, has evolved AI infrastructure by merging GPU virtualization with Kubernetes-native capabilities. Our world class AI platform allows organizations to improve productivity and efficiency for data scientists and machine learning engineers. With deep Kubernetes expertise and a team of extraordinary individuals, we are looking for a highly skilled Senior Software Engineer to join the team and help shape the future of AI technology. The role of a Senior Software Engineer in the Run:ai group is to design and develop scalable, high-performance systems that support the next generation of AI workloads and infrastructure. You will collaborate with experts across domains, tackle complex challenges, and drive innovations that empower our users to push the limits of AI capabilities.
What youll be doing:
Designing and developing enterprise-grade systems with a strong focus on scalability, reliability, and performance.
Building and optimizing microservices-based architectures using Kubernetes and cloud-native technologies.
Collaborating closely with backend engineers, product managers, and other stakeholders to deliver impactful solutions.
Writing clean, maintainable, and testable code in Go, contributing to our CI/CD pipelines.
Conducting code and design reviews to uphold high-quality standards and mentor team members.

Requirements:
B.Sc. in Computer Science or a related field (or equivalent experience).
5+ years of experience in backend software development, including system design and architecture.
Proficiency in at least one backend programming language (Go preferred).
Strong understanding of microservices architecture, RESTful APIs, and relational databases.
Deep familiarity with Kubernetes and the cloud-native ecosystem.
Demonstrated ability to tackle complex technical challenges and deliver high-quality solutions.
Ways to stand out from the crowd:
Expertise in Kubernetes internals and advanced cloud-native technologies.
Hands-on experience with HPC, GPU virtualization, or AI/ML platforms.
Experience working in Linux environments with knowledge of networking, security, and virtualization.
Contributions to open-source projects or active participation in tech communities.
Agile approach and familiarity with standard methodologies.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8329770

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

חברת השמה / כח אדם

07/09/2025

Senior MLOps Engineer

חברה חסויה

Location: Tel Aviv-Yafo

Job Type: Full Time

Realize your potential by joining the leading performance-driven advertising company!
As a Senior MLOps Engineer on the Infra group, youll play a vital role in develop, enhance and maintain highly scalable Machine-Learning infrastructures and tools.
About Algo platform:
The objective of the algo platform group is to own the existing algo platform (including health, stability, productivity and enablement), to facilitate and be involved in new platform experimentation within the algo craft and lead the platformization of the parts which should graduate into production scale. This includes support of ongoing ML projects while ensuring smooth operations and infrastructure reliability, owning a full set of capabilities, design and planning, implementation and production care.
The group has deep ties with both the algo craft as well as the infra group. The group reports to the infra department and has a dotted line reporting to the algo craft leadership.
The group serves as the professional authority when it comes to ML engineering and ML ops, serves as a focal point in a multidisciplinary team of algorithm researchers, product managers, and engineers and works with the most senior talent within the algo craft in order to achieve ML excellence.
How youll make an impact:
As a Senior MLOps Engineer Engineer, youll bring value by:
Develop, enhance and maintain highly scalable Machine-Learning infrastructures and tools, including CI/CD, monitoring and alerting and more
Have end to end ownership: Design, develop, deploy, measure and maintain our machine learning platform, ensuring high availability, high scalability and efficient resource utilization
Identify and evaluate new technologies to improve performance, maintainability, and reliability of our machine learning systems
Work in tandem with the engineering-focused and algorithm-focused teams in order to improve our platform and optimize performance
Optimize machine learning systems to scale and utilize modern compute environments (e.g. distributed clusters, CPU and GPU) and continuously seek potential optimization opportunities.
Build and maintain tools for automation, deployment, monitoring, and operations.
Troubleshoot issues in our development, production and test environments
Influence directly on the way billions of people discover the internet.

Requirements:
Experience developing large scale systems. Experience with filesystems, server architectures, distributed systems, SQL and No-SQL. Experience with Spark and Airflow / other orchestration platforms is a big plus.
Highly skilled in software engineering methods. 5+ years experience.
Passion for ML engineering and for creating and improving platforms
Experience with designing and supporting ML pipelines and models in production environment
Excellent coding skills in Java & Python
Experience with TensorFlow a big plus
Possess strong problem solving and critical thinking skills
BSc in Computer Science or related field.
Proven ability to work effectively and independently across multiple teams and beyond organizational boundaries
Deep understanding of strong Computer Science fundamentals: object-oriented design, data structures systems, applications programming and multi threading programming
Strong communication skills to be able to present insights and ideas, and excellent English, required to communicate with our global teams.
Bonus points if you have:
Experience in leading Algorithms projects or teams.
Experience in developing models using deep learning techniques and tools
Experience in developing software within a distributed computation framework.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8335971

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

26/08/2025

Senior Software Development Engineer

חברה חסויה

Location: Ra'anana

Job Type: Full Time

our company has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Its a unique legacy of innovation thats fueled by great technologyand amazing people. Today, were tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing whats never been done before takes vision, innovation, and the worlds best talent. As a worker, youll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
we are looking for a Senior Software Development Engineer to contribute to cutting-edge Network Management System of the most powerful super-computers in the world. Our team is growing, and we are looking for hardworking and self-motivated engineers to develop and verify advanced, high-scale SDN management solutions. You will be part of a dynamic team, working with amazing people.
What Youll Be Doing:
You will have a significant impact in developing the next-generation Unified Fabric Manager (UFM) product.
Help drive the underlying technology stack and implementation methodology, ensuring it competes at a world-class level.
Collaborate closely with other SW R&D teams and SW Architects to successfully implement ambitious projects.
Engage in performance tuning and automation to build a flawless operational environment.
Design and implement micro-services architecture to support our advanced, high-scale SDN management solutions.
Work in an agile environment, ensuring continuous improvement and innovation.

Requirements:
We are looking for candidates with the following proven qualifications and experience:
B.Sc. or equivalent experience in Computer Science or a related field.
10+ Hands-on experience with system software design, development, and maintenance, particularly in C/C++ programming.
Debugging and performance analysis skills are strictly required.
Significant advantage if you have Python programming experience.
Proficiency with Dockers, Kubernetes, and other orchestration tools.
Background with RESTful web services and experience with Continuous Integration and Continuous Delivery.
Excellent interpersonal and written communication skills to foster collaboration and inclusion.
Ways to stand out from the crowd:
Extensive knowledge and deep understanding of Linux system programming.
A track record of solving sophisticated problems with elegant solutions.
Demonstrated ability to deliver complex projects in previous roles.
Experience building infrastructures and tools to speed up development, testing, and release.
Experience in agile software development methodology.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8319866

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior Software Engineer

חברה חסויה

Location: Ra'anana

Job Type: Full Time

we are looking for an experienced SW Engineer with desire and ability to contribute and lead cutting edge Network Management System of most powerful super-computers in the world. Our team is growing, and we are looking for hardworking and self-motivated engineers to lead building of advanced, high scale SDN management solutions. You will be part of a dynamic team, working with amazing people. This crucial role will give you a rare opportunity to craft and deliver a new class of Data Center NMS product line.
What you'll be doing:
The team develops infrastructure for monitoring and gathering telemetry from production environments, running on the worlds largest supercomputers and datacenters.
The work environment is dynamic and challenging; we are innovating and inventing software products at the forefront of technology in terms of performance, scalability, and features.
Our team works closely with other engineering teams to co-design new features and software APIs.

Requirements:
B.Sc. or equivalent experience in computer science / software engineering.
5 years experience of Programming in Python and C/C++.
3 years experience in Linux environment and tools.
Deep knowledge of Networking Protocols InfiniBand, Ethernet.
Expert knowledge in computer architecture and operating systems.
Experience in performance optimizations.
Ways to stand out from the crowd:
You have positive attitude and work well with others.
Demonstrated use of creative ideas, providing solutions to challenging problems.
Knowledge in RDMA technology.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321918

שירות זה פתוח ללקוחות VIP בלבד

שמך המלאמה השם שלך?

מייל

תיאור

שליחה

תודה על שיתוף הפעולה

מודים לך שלקחת חלק בשיפור התוכן שלנו :)

המשרה נמחקה

תוכל לצפות בה בדף המשרות שלי

המשרה הוחזרה לרשימת תוצאות החיפוש

האם תרצה להסיר את המשרה מרשימת

המשרות השמורות שלך?

כן לא

אירעה שגיאה בשליחת פרטיך למשרה

27/08/2025

Senior HPC Performance Engineer

חברה חסויה

Location: Tel Aviv-Yafo and Yokne`am

Job Type: Full Time

we are leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
Come work for the team that brought to you NCCL, NVSHMEM & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications! We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize our company's vision?
What you will be doing:
Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
Triage and root-cause performance issues reported by our customers
Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information
Collaborate with a very dynamic team across multiple time zones.

Requirements:
M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience
3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
Experience conducting performance benchmarking and triage on large scale HPC clusters
Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
Implement micro-benchmarks in C/C++, read and modify the code base when required
Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python
Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)
Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones
Ways to stand out from the crowd:
Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control
Experience debugging network issues in large scale deployments
Familiarity with CUDA programming and/or GPUs
Experience with Deep Learning Frameworks such PyTorch, TensorFlow.

This position is open to all candidates.

עדכון קורות החיים לפני שליחה

8321604

שירות זה פתוח ללקוחות VIP בלבד