דרושים » הנדסה » Senior HPC AI Cluster Engineer

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
1 ימים
חברה חסויה
Location: More than one
Job Type: Full Time
we are looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. we are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC, be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.
What you will be doing:
Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
Manage Linux job/workload schedules and orchestration tools
Develop and maintain continuous integration and delivery pipelines
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
Deploy monitoring solutions for the servers, network and storage
Perform troubleshooting bottom up from bare metal, operating system, software stack and application level
Being a technical resource, develop, re-define and document standard methodologies to share with internal teams
Support Research & Development activities and engage in POCs/POVs for future improvements.
Requirements:
A degree in Computer Science, Engineering, or a related field
5+ years of experience
Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience.
Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
Deep knowledge of Networking Protocols like InfiniBand, Ethernet
Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)
Ways to stand out from the crowd:
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)
Knowledge of CPU and/or GPU architecture
Knowledge of Kubernetes, container related microservice technologies
Experience with GPU-focused hardware/software (DGX, Cuda)
Background with RDMA (InfiniBand or RoCE) fabrics.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8317649
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer- Infra
Realize your potential by joining the leading performance-driven advertising company!
As a Site Reliability Engineer- infra, on our Infrastructure team at the TLV office, you will play a key role in ensuring the reliability, scalability, and performance of our critical systems. You will be responsible for managing and improving our core infrastructure, with a focus on automation, monitoring, and incident response. You will work with a wide range of technologies, including Kubernetes, monitoring and observability tools, configuration management systems, and core networking services.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure the reliability, availability, and performance of our infrastructure services.
Manage and maintain our Kubernetes infrastructure, including KubeVirt.
Design, implement, and maintain our monitoring and observability stack (SensuGo, VictoriaMetrics, Prometheus, ELK).
Automate infrastructure provisioning, configuration, and deployment processes using Puppet and Ansible.
Manage and maintain core services such as DNS and networking.
Troubleshoot and resolve complex infrastructure issues in a timely and efficient manner.
Participate in on-call rotations and incident response.
Develop and maintain infrastructure-as-code (IaC).
Identify and implement proactive measures to prevent incidents and improve system reliability.
Collaborate with development teams to ensure smooth and reliable deployments.
Contribute to the design and implementation of new infrastructure solutions.
Drive improvements in system architecture, processes, and tools.
Mentor and coach other team members.
Requirements:
5+ years of experience in a Site Reliability Engineering, Systems Engineering, or similar role.
Deep understanding of Site Reliability Engineering principles and practices.
Extensive experience with Kubernetes, including deployment, management, and troubleshooting.
Strong experience with monitoring and observability tools such as SensuGo, Zabbix, VictoriaMetrics, Prometheus, and ELK.
Proficiency in configuration management tools such as Puppet and Ansible.
Solid understanding of Linux internals and networking.
Experience with managing and maintaining core services such as DNS and networking.
Strong programming skills in Python and/or Go.
Experience with both on-premises and cloud environments.
Experience with KubeVirt.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Ability to participate in on-call rotations including weekends.
Preferred Qualifications:
Experience with large-scale, distributed systems.
Experience with other cloud providers (e.g., AWS, Azure, GCP).
Contributions to open-source projects.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8272676
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 5 דקות
Location: Ra'anana
Job Type: Full Time
we are seeking a hands-on Software Manager to lead an engineering team developing next-generation, cloud-native infrastructure based on Kubernetes for AI and HPC workloads. Youll manage a high-impact team building scalable systems powered by DPUs and our companys advanced networking technologies. In this role, youll collaborate closely with architecture and marketing teams to shape system design, align technical direction with business goals, and ensure strong execution. This is a unique opportunity to lead a top-tier team delivering infrastructure at scale.
What youll be doing:
Lead and coordinate a team building K8s-based infrastructure for AI and HPC.
Oversee feature delivery with active involvement in design, development, and debugging.
Guide the team to build scalable, high-performance systems using our companys compute and networking technologies.
Collaborate with architecture and marketing to align strategy and influence design.
Drive recruitment, mentorship, and foster a culture of innovation and excellence.
Requirements:
Bachelors degree in Computer Science or equivalent experience.
8+ overall years of software development experience, including 2+ years in leadership roles managing teams.
Deep hands-on expertise with K8s and the cloud-native ecosystem.
Strong proficiency in Go and Python programming languages.
Experience designing and operating large-scale distributed systems.
Proven ability to work effectively in remote and cross-functional teams.
Ways to stand out from the crowd:
Background in AI, HPC, or cloud infrastructure.
Familiarity with our company hardware, including DPUs, BlueField, and ConnectX.
Active involvement in open source projects or communities.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8319779
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Ra'anana
Job Type: Full Time
we are a leader in disaggregated high-scale networking solutions for service providers and AI infrastructures. Founded in December 2015, our company created a radical new way to build networks by adapting the architectural model of the cloud to telco-grade networking. This solution accelerates network deployment, improves the networks economic model, and radically simplifies network operations. With customers including Comcast, Orange, and KDDI - over 80% of AT&Ts network traffic now runs through a disaggregated core powered by our company's software. our company's Network Cloud-AI solution, based on the same technology, was introduced to the market in 2023, providing the highest-performance Ethernet-based AI networking solution, and is already deployed by Hyperscalers, NeoClouds and Enterprises. Raising over $587 million in three funding rounds, our company continues to deploy the most innovative network infrastructure and is looking for the most talented people to be part of this journey.
The Role
we are seeking a highly motivated and skilled Software Engineer to join our Hardware Software team. In this role, you will be responsible for the development and integration of Board Support Packages (BSP) and low-level firmware components for our carrier-grade networking solutions. Carrier-grade routers/switches designed for service providers or data center networks. The systems integrate ASICs and high-throughput backplanes supporting multi-terabit line rates. You will work closely with hardware, platform, and system architects to bring up new hardware platforms and support advanced network functionalities in high-performance environments.
Key Responsibilities:
Develop, integrate, and maintain BSP components, including bootloaders (e.g., U-Boot), device trees, and hardware abstraction layers.
Design and implement firmware and low-level drivers for network-centric hardware platforms (e.g., ASICs, NICs, SoCs, CPLDs, FPGAs).
Support hardware bring-up and board validation, collaborating with hardware engineers and system integrators.
Work on performance optimization, debugging, and stability improvements of system software on embedded Linux platforms.
Interface with third-party SDKs and adapt them to fit within our companys software infrastructure.
Ensure compliance with industry standards and best practices for networking and embedded systems.
Requirements:
Requirements:
BSc or MSc in Computer Science, Electrical Engineering, or related technical field.
8+ years of experience in embedded software development, preferably in the networking or telecommunications industry.
Proficiency in C/C++ for low-level system development.
Strong experience with embedded Linux, bootloaders, kernel configuration, and driver development.
Familiarity with SoC architectures (e.g., ARM, MIPS) and board bring-up procedures.
Hands-on experience with hardware debugging tools (oscilloscopes, JTAG, logic analyzers).
Knowledge of networking protocols and hardware (Ethernet, switching/routing, PHYs) is a strong plus.
Experience with Broadcom SDKs, ONIE, or network operating systems (NOS) is an advantage.
Nice to Have:
Background in data center or service provider environments.
Exposure to high end routers or switches platforms
Why us?
Work on cutting-edge cloud-native networking solutions that scale to the worlds largest networks.
Be part of a fast-paced, innovative team thats transforming the telecom and hyperscale networking space.
Great growth opportunities in a global, technology-driven company.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8263768
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
5 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a talented highly-motivated experienced SW engineer to join one of its growing inspiring development teams.
You will work on multi-tenant, high-scale, distributed SaaS echo system on top of k8s platform which is used for managing the cloud security services infrastructure, customers' self-service configuration, monitoring and reporting, analytics and more.

As a SW engineer, you will manage and work with the different Engineering teams and architects in order to design, develop, monitor, scale and optimize the large-scale architecture of a winning SaaS security service.

What will you do?

Implement our implementation of next generation back-end infrastructure to help us scale our SaaS based infrastructure.

Be part of a team building tools to make our infrastructure scalable, and robust.

Leverage Generative AI tools for code generation, optimization, debugging, documentation, and prototyping.

Continuously research and integrate new AI-driven developer productivity tools.

Design and develop an always-available Cloud-based SaaS platform in AWS

Lead and Design the development of robust CI/CD pipelines for Kubernetes running Containerized applications

Design and build strong Application and System monitoring and automated self-healing procedures.

Maintain and support application deployments, building new systems and upgrading existing.

Working closely with all the Engineering and DevOps teams, taking full responsibility and ownership from conception to post-deployment in a collaborative, fast-paced environment.
Requirements:
6+ years of experience in infrastructure and Backend SW development roles.

Experience managing infrastructure on AWS.

Experience with architecture methodologies and paradigms like micro-services, distributed systems and more.

Experience integrating and actively using GenAI tools (e.g., GitHub Copilot, Claude, ChatGPT etc) in daily development must.

Open-minded to new workflows and AI-driven innovation.

An agile/DevOps way of thinking.

Experience with CI/CD tools (Jenkins, argot, Nexus and similar).

Experience with the K8S platform and tools (Helm charts and similar).

Experience with the following technologies/tools/fields: Elasticsearch , Clickhouse, Messaging (Kafka,NATS,, Redis etc), Monitoring and Visibility (Prometheus, Grafana, loki, etc).

Programming languages Golang/ Java.

Functioning well under pressure.

Strong problem-solving ability and a "Can-do approach".

Working in an agile environment.

Excellent communication and interpersonal skills.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8312368
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 22 שעות
Location: Tel Aviv-Yafo and Yokne`am
Job Type: Full Time
we are at the forefront of AI-driven innovation in VLSI design automation. Join us to shape the future of semiconductor design with cutting-edge AI tools and make a significant impact in a collaborative, high-performance environment. Are you ready to push the boundaries of whats possible in VLSI CAD? Come be part of our pioneering team!
What you'll be doing:
You will be responsible for developing and integrating advanced CAD solutions and automation flows using AI and machine learning for VLSI design, verification, and implementation.
Work closely with design, verification, and CAD teams to identify areas for improving VLSI workflows using advanced tools and methods.
Research, prototype, and deploy AI-based algorithms.
Develop and maintain scripts and automation infrastructure to enable seamless adoption of AI tools in the VLSI design process.
Continuously review emerging AI technologies and methodologies to keep our CAD environment up-to-date.
Provide technical support and training to engineering teams on AI-enabled CAD flows and best practices.
Requirements:
B.Sc./M.Sc. in Electrical Engineering, Computer Engineering, Computer Science, or equivalent experience.
5+ years of experience in VLSI CAD tool development, with a strong focus on integrating AI/ML techniques into EDA workflows.
Proficiency in Python and at least one AI/ML framework (such as TensorFlow, PyTorch, or scikit-learn).
Hands-on experience with VLSI physical design and familiarity with industry-standard EDA tools (e.g., Synopsys, Cadence).
Knowledge of data preprocessing, feature engineering, and model deployment as applied to VLSI design challenges.
Experience developing and maintaining automation scripts (Python, Perl, Tcl, Make).
Strong analytical skills in evaluating the impact of AI solutions on design quality, performance, and productivity.
Excellent communication skills and the ability to work cross-functionally in a fast-paced environment.
Self-motivation, attention to detail, and a track record of delivering robust solutions to production.
Ways to stand out from the crowd:
Demonstrated experience deploying AI/ML models in production VLSI CAD environments.
Contributions to open-source AI/EDA projects or publications in relevant conferences/journals.
Deep understanding of VLSI design challenges-such as timing closure, power optimization, or yield enhancement-and how AI can address them.
Experience with cloud-based or distributed compute environments for large-scale AI training and inference.
Strong ownership, curiosity, and a passion for continuous learning and innovation.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8318297
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Ra'anana
Job Type: Full Time
we are a leader in cloud-native networking software for Hyperscalers and service providers who are building the largest infrastructures in the world for network services and AI platforms. Founded in December 2015, our company disrupted some of the most challenging high-scale markets, transforming the way Networks are built, scaled, and consumed. We also built the largest network in the world, with more than half of AT&Ts backbone running on our companys Network Cloud. our companyhas raised $587 million in three funding rounds which enable us to dream big and bring on the most talented people.
The Role
We are looking for a Full Stack Software Engineer to join our Network Orchestration group. Our group is responsible for developing scalable, high-performance distributed systems that support complex network infrastructures.
In this role, you will be working on both backend and frontend development, contributing to the design, implementation, and optimization of our software solutions. You will work end to end on features, from design and development to deployment and monitoring in production.
You will collaborate closely with team members, QA engineers, Product Managers, Project Managers, and UI/UX Designers to deliver high-quality, production-ready applications.
We embrace an agile mindset you should be comfortable with context switching, handling multiple priorities, and adapting quickly to changing requirements.
Responsibilities
Develop and maintain backend services using Node.js (TypeScript, NestJS/Express) and frontend applications with Angular.
Work end to end on features, from design and development to deployment and monitoring in production.
Write clean, maintainable, and well-tested code, following best practices.
Work with relational and NoSQL databases, designing efficient data models and queries.
Collaborate with QA engineers, Product Managers, Project Managers, and UI/UX Designers to deliver high-quality features.
Participate in code reviews, providing constructive feedback and ensuring best practices.
Troubleshoot and debug application issues, ensuring smooth functionality in production.
Work with distributed systems, understanding their challenges and ensuring scalability and reliability.
Stay updated with modern development trends, frameworks, and best practices.
Requirements:
Full Stack Development: Strong hands-on experience with Node.js (TypeScript, NestJS/Express) for backend and Angular for frontend.
Technical Expertise: 3+ years of experience developing production-grade applications.
Backend Development: Knowledge of REST APIs, microservices, and working with structured data.
Frontend Development: Experience with Angular, TypeScript, RxJS, and understanding of component-based architecture.
Databases: Experience working with SQL (PostgreSQL, CockroachDB) and NoSQL (MongoDB, Redis).
Distributed Systems: Understanding of scalable architectures, with some experience working in distributed environments.
Testing & Debugging: Experience writing unit tests, automations tests, and debugging applications.
Agile & Adaptability: Ability to work in a fast-paced, dynamic environment, handling multiple priorities and adapting to changes effectively.
Collaboration & Communication: Ability to work closely with QA, Product, Project Managers, and UI/UX Designers to deliver high-quality features.
Nice to Have
Experience with Cloud providers (AWS, GCP, Azure).
Understanding of Kubernetes (k8s) and cloud-native deployments.
Experience with React.
Familiarity with Nx platform for monorepo management.
Knowledge of advanced security concepts such as TLS, encryption, and authentication mechanisms.
Experience working with event-driven architectures and messaging systems (RabbitMQ, Kafka, etc.).
Knowledge of gNMI, Netconf, gRPC, or other network management protocols.
Background in real-time telemetry and network monitoring.
Education And/or Relevant Experience.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8264223
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 15 דקות
Location: Ra'anana
Job Type: Full Time
we are looking for an excellent Software Engineer for the Switch SDK Group. You will join the SDK group and take our product to next level, working closely with various other design and architecture teams and gain a deep understanding of our companys products and technologies. our company has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Its a unique legacy of innovation thats fueled by great technologyand amazing people.
Today, were tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing whats never been done before takes vision, innovation, and the worlds best talent. As a worker, youll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
What youll be doing:
Design, develop, optimize and maintain APIs, tools and libraries for Switching, Routing, Analytics, Telemetry and many other modules
Collaborate with team members, Architects, QA teams, and customers (both external and internal)
Innovate & rapidly develop POC prototypes that can then be developed into full-fledged products/solutions.
Requirements:
B.Sc. in Software Engineering / Computer Science / related field or equivalent work experience will be considered as well
10+ years of experience as a Software Engineer, including experience with C programming
Experience with Embedded/ RT Embedded systems
Excellent C programming skills, with a keen eye for performance and writing optimized code
Strong analytical skills, deep knowledge of algorithms and proficiency with data structures
Excellent communication and documentation skills
Ways to stand out from the crowd:
Previous experience with Ethernet Switching or Routing protocols
Hands on Linux development, user-space and/or kernel-space.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8319754
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
24/07/2025
Location: Tel Aviv-Yafo
Job Type: Full Time
We are seeking a Senior Platform Engineer, Observability to join our Observability team. This role offers the opportunity to work at the intersection of software development and platform engineering, contributing to the tools, systems, and practices that improve visibility, reliability, and operational excellence across our engineering organization.

This position is ideally suited for experienced software engineers who are passionate about building high-quality systems and are interested in expanding their expertise in observability, distributed systems, and developer experience. You will help design, build and maintain systems that empower engineers across us to monitor, understand, and troubleshoot their services more effectively.

Our observability team is responsible for delivering scalable and user-friendly solutions to over 150 engineers working across more than 20 teams. Were focused on enabling rapid incident detection and resolution, improving our reliability posture, and supporting a culture of continuous improvement.

What you'll be doing:
Design, build, and maintain observability tools and infrastructure that help our engineers provide actionable insights into the performance and reliability of our systems.
Collaborate with other engineers and teams to enhance the developer experience around monitoring, logging, alerting, and tracing.
Develop and evolve our internal tooling to simplify the process of instrumenting and observing services.
Partner with engineering teams to improve incident response and recovery workflows, and ensure systems meet internal SLOs/SLAs and reliability targets.
Support the migration from our legacy ELK stack to a modern observability platform using Prometheus, Mimir, Grafana, Honeycomb, Loki, Quickwit, and OpenTelemetry.
Contribute to knowledge sharing and the ongoing development of best practices in observability across the organisation.
Requirements:
What you'll need:
4+ years of professional experience as a software engineer, with a strong foundation in building and maintaining production systems.
Proficiency in one or more modern programming languages such as Python, Java, JavaScript, or Ruby.
Familiarity with Kubernetes, AWS, and infrastructure-as-code tools such as Terraform.
Experience working with observability tools and platforms (e.g. Prometheus, Grafana, ELK, Honeycomb, Loki, or similar).
A strong interest in developer experience and platform tooling, with the ability to empathise with engineering teams as internal customers.
Excellent communication skills, with the ability to collaborate effectively across teams and explain complex technical concepts clearly.
A proactive mindset focused on long-term impact, sustainable engineering practices, and continuous improvement.

Preferred Qualifications:
Experience with OpenTelemetry or distributed tracing systems.
Understanding of observability-driven development and service reliability principles (e.g. SRE, MTTR, SLIs/SLOs).
Experience optimising observability systems for cost and performance at scale.
Knowledge of microservices architectures and how to monitor and debug distributed systems.
Contributions to open-source projects in the observability or monitoring space
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8274690
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Were looking for an experienced MLOps / DevOps Engineer to design and manage the infrastructure powering large-scale machine learning systems. Youll be responsible for deploying GPU-heavy models (including LLMs) on cost-efficient, production-grade infrastructure, supporting both ML workflows and application artifact delivery.

Youll work with cutting-edge technologies like vLLM, Triton, SageMaker, ClearML, Karpenter, KEDA, and EKS, ensuring the right balance between performance, scalability, and cost.



What Youll Do

Deploy and manage LLMs and deep learning models using vLLM, Triton Inference Server, and custom API endpoints.

Build and maintain GPU-aware autoscaling clusters using AWS EKS, Karpenter, and KEDA, optimizing for cost-efficiency and performance.

Develop CI/CD pipelines using Jenkins and GitHub Actions to automate ML model delivery and application deployments.

Orchestrate training, fine-tuning, and inference jobs on AWS SageMaker and ClearML, with support for experiment tracking, versioning, and reproducibility.

Support backend teams in deploying app artifacts and runtime environments; implement rollback and release strategies.

Integrate observability tooling (e.g., Prometheus, Grafana, ELK, or OpenTelemetry) for both infrastructure and model performance.

Collaborate with SREs to enforce high availability, disaster recovery, and incident response procedures for mission-critical AI services.
Requirements:
6+ years of experience in DevOps, MLOps, or infrastructure roles with a focus on ML model delivery.

Proven hands-on experience deploying GPU-based models (LLMs, vision, transformers) using vLLM or Triton.

Deep knowledge of AWS EKS and Kubernetes, with practical experience configuring Karpenter and KEDA for auto-scaling GPU workloads.

Experience building pipelines using Jenkins, GitHub Actions, and managing releases for ML and application codebases.

Familiarity with AWS SageMaker, ClearML, or similar platforms for ML orchestration and experimentation.

Strong scripting and automation skills in Python, Bash, and working knowledge of containerization (Docker).

Solid grasp of networking, IAM, and cloud security fundamentals.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8268730
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Ra'anana
Job Type: Full Time
You will participate in implementing a new strategic platform offering Generative AI (Gen AI) capabilities such as inference, fine-tuning, data curation, data labeling, and more.

We are seeking a highly skilled Gen AI software engineer to design, develop, and deploy state-of-the-art, scalable, efficient Gen AI solutions. You will be part of a high-performing engineering team that values collaboration, innovation, and teamwork. This role is perfect for someone who thrives in an environment where ideas are shared, challenges are met together, and success is a collective achievement.

Responsibilities:
Designed and implemented Gen AI platform components using Python and modern microservices.
Drive engineering and operational excellence and best practices.
Collaborate with tech, product, program, and business peers to deliver customer value and drive product vision.
Drive the end-to-end development of AI systems, from research and experimentation to deployment and scaling
Mentor and support engineers, fostering a culture of collaboration, knowledge-sharing, and continuous learning
Analyze complex problems and identify and define the requirements to solve them.
Apply DevOps principles to maintain high-performance and scalable AI systems.
Design and implement secure APIs, ensuring data protection and safe interaction with AI models.
Stay updated on the latest Gen AI trends and technologies, bringing fresh ideas to the team
Requirements:
Requirements:
3+ years of experience in applying AI to practical and comprehensive technology solutions.
Strong understanding of Generative AI principles, including Large Language Models (LLMs), RAG techniques, Fine-tuning, inference, and more a big advantage.
Proficiency in Python and experience with Gen AI, Gen-related Python libraries and modules.
Experience contributing to the architecture and design (architecture, design patterns, reliability, and scale) of new and current systems.
Excellent problem-solving and analytical skills.
Expertise with vector databases (e.g., OpenSearch, Pinecone) for data storage and retrieval an advantage.
Excellent interpersonal skills, team player with strong communication skills.
Academic degree in Computer Science or equivalent.

Certifications:
AWS Machine Learning Specialty (Strongly Preferred).
AWS Solutions Architect - Associate (Strongly Preferred).
AWS Solutions Architect - Professional (Preferred).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8311442
סגור
שירות זה פתוח ללקוחות VIP בלבד