דרושים » הנדסה » MLOps Engineer - AI Infra Group

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
:Responsibilities
Design, build, and maintain scalable Kubernetes-based infrastructure for ML workloads across on-premise and cloud environments
Architect hybrid infrastructure solutions enabling seamless model flow from on-premise training environments to cloud-based inference deployments
Implement model registry and artifact management strategies that support cross-environment synchronization, versioning, and governance
Design secure, efficient data and model transfer mechanisms between on-premise and cloud (networking, storage replication, caching strategies)
Implement and manage GPU scheduling, resource allocation, and cluster autoscaling for heterogeneous compute environments
Build and maintain CI/CD pipelines for ML systems, including model versioning, testing, and promotion across environments
Develop observability solutions (logging, monitoring, alerting) for ML infrastructure across hybrid deployments
Collaborate with ML Engineers to define infrastructure requirements and SLAs for training and serving workloads
Requirements:
5+ years of experience in infrastructure engineering, platform engineering, or DevOps, preferably supporting ML or data-intensive workloads
Experience designing and operating hybrid cloud architectures (on-premise + cloud) with focus on data/model synchronization
Familiarity with model registry solutions (MLflow or cloud-native registries) and artifact management at scale
Experience with GPU compute infrastructure, device plugins, and resource scheduling (e.g., NVIDIA GPU Operator)
Proficiency in IaC tools (Terraform) and GitOps practices (ArgoCD)
Experience with monitoring and observability stacks (Prometheus, Grafana, ELK)
Familiarity with ML workflows to understand workload characteristics and requirements
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504251
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.
:Responsibilities
Design and optimize LLM serving infrastructure using inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Implement and tune distributed inference strategies including tensor parallelism, pipeline parallelism, and multi-node serving
Develop and apply model compression techniques to optimize cost, latency, and memory footprint while maintaining model quality
Build self-service fine-tuning platforms that enable data scientists to run experiments (LoRA, QLoRA, full fine-tuning) in a standardized, reproducible, and governed manner
Optimize inference performance through batching strategies, KV-cache tuning, and speculative decoding
Develop reusable APIs, abstractions, and platform services for model deployment, scaling, and lifecycle management
Collaborate with AI researchers and product teams to productionize models and meet latency/throughput requirements
Evaluate and benchmark new model architectures, compression methods, and serving frameworks
Requirements:
5+ years of experience in software engineering or ml engineering with significant focus on ML systems or backend infrastructure
Strong proficiency in Python and deep learning frameworks (PyTorch)
Hands-on experience with LLM inference engines (vLLM, TensorRT-LLM, Triton Inference Server)
Deep understanding of transformer architectures and LLM-specific optimizations (attention mechanisms, KV-cache, quantization techniques like GPTQ, AWQ, GGUF)
Experience with distributed training/fine-tuning frameworks (Ray, DeepSpeed, FSDP)
Ability to build developer-facing tools and platforms with clear APIs and documentation
Understanding of GPU performance profiling and optimization
Familiarity with LLM evaluation methodologies and benchmarking
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504260
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
Location: Tel Aviv-Yafo
Job Type: Full Time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI systems and ML pipelines to join our AI group. You'll be responsible for designing, building, deploying, and maintaining production-grade AI systems and ML pipelines. Youll work closely with data scientists to translate research into scalable solutions and manage model deployment in both cloud and on-prem GPU environments.
:Responsibilities
Design, build, and deploy production-grade ML models, AI agents, and end-to-end pipelines across cloud and on-prem GPU environments.
Maintain and optimize ML systems for performance, scalability and reliability, including model validation, inference speed, and resource efficiency.
Develop monitoring and observability tools such as alerts and performance metrics to ensure system stability in production.
Create and integrate APIs for ML services within microservice-based architectures.
Drive adoption of best practices for CI/CD, observability, and reproducibility in ML systems.
Requirements:
3+ years of experience delivering production-grade ML/AI systems
Strong Python skills and solid understanding of the ML lifecycle
Experience with GPU infrastructure, containerization (Docker) and cloud platforms
Familiarity with microservice architectures and API development
Hands-on experience with LLM pipelines and agent orchestration frameworks (LangGraph, LlamaIndex, etc.)
Knowledge of experiment tracking tools (Weights & Biases, MLflow, or similar)
Background in scalable ML infrastructure, distributed computing, and workflow orchestration frameworks (Ray, Kubeflow, Airflow)
Experience with multi-node training (advantage)
Collaborative mindset with startup-level ownership and pragmatism
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504290
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
Location: Tel Aviv-Yafo
Job Type: Full Time
It starts with you - an engineer driven to build resilient, automated infrastructure that enables teams to move fast with confidence. You care about operational excellence, developer experience, and reliability at scale. Youll architect and operate the compute and networking infrastructure that powers our AI platform - from CI/CD pipelines to Kubernetes clusters to observability systems - across cloud and on-prem environments.
If you want to build infrastructure that powers mission-critical AI systems at national scale, join mission - this role is for you.
:Responsibilities
Architect and operate Kubernetes-based infrastructure across AWS and on-prem environments, ensuring high availability, security, and performance.
Design and maintain CI/CD pipelines for application and service deployments with automated testing, security scanning, and rollback capabilities.
Drive infrastructure-as-code practices for compute and networking - building reproducible, auditable, and version-controlled infrastructure.
Own reliability and incident response - establish SLOs, build alerting systems, lead incident resolution, and drive post-incident improvements.
Enable AI-native operations - support agentic deployment pipelines, self-healing infrastructure, and secure sandboxing for model experimentation.
Build and maintain observability systems - metrics, logging, tracing, and dashboards that provide visibility into system health.
Optimize infrastructure cost and performance - right-size resources, implement auto-scaling, and identify efficiency opportunities.
Collaborate with Engineering, Data Platform, Data Engineering, and Security teams to align infrastructure with platform needs.
Shape infrastructure characteristics that support data freshness, correctness, and low-latency pathways for AI training/inference, retrieval, and agentic workflows.
Contribute paved-road tooling - reusable CI/CD patterns for services, IaC modules for compute and networking, and runbooks - that streamline delivery across teams.
Collaborate with Engineering, Data Platform, Data Engineering, Security, Product, AI/ML, Data Science, and Analytics to anticipate and meet cross-functional needs.
Requirements:
6+ years in DevOps, SRE, or infrastructure engineering, with hands-on experience building and operating infrastructure at scale.
Container orchestration - Kubernetes (EKS, on-prem), Helm, service mesh technologies like Istio or Linkerd
Cloud & infrastructure - AWS services (EC2, EKS, S3, IAM, VPC, Lambda), hybrid cloud architectures, on-prem infrastructure
Infrastructure-as-Code - Terraform, Pulumi, or CloudFormation; GitOps practices with ArgoCD or Flux
CI/CD - GitHub Actions, GitLab CI, Jenkins, or similar; artifact management, deployment strategies (blue-green, canary)
Observability - Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar; distributed tracing, log aggregation, alerting
Security & compliance - Secrets management (Vault, AWS Secrets Manager), network security, compliance automation
Scripting & automation - Python, Bash, Go; configuration management with Ansible or similar
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504217
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
11/12/2025
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Senior DevOps Engineer, Data Platform
The opportunity
Technical Leadership & Architecture: Drive data infrastructure strategy and establish standardized patterns for AI/ML workloads, with direct influence on architectural decisions across data and engineering teams
DataOps Excellence: Create seamless developer experience through self-service capabilities while significantly improving data engineer productivity and pipeline reliability metrics
Cross-Functional Innovation: Lead collaboration between DevOps, Data Engineering, and ML Operations teams to unify our approach to infrastructure as code and orchestration platforms
Technology Breadth & Growth: Work across the full DataOps spectrum from pipeline orchestration to AI/ML infrastructure, with clear advancement opportunities as a senior infrastructure engineer
Strategic Business Impact: Build scalable analytics capabilities that provide direct line of sight between your infrastructure work and business outcomes through reliable, cutting-edge data solutions
What you'll be doing
Design Data-Native Cloud Solutions - Design and implement scalable data infrastructure across multiple environments using Kubernetes, orchestration platforms, and IaC to power our AI, ML, and analytics ecosystem
Define DataOps Technical Strategy - Shape the technical vision and roadmap for our data infrastructure capabilities, aligning DevOps, Data Engineering, and ML teams around common patterns and practices
Accelerate Data Engineer Experience - Spearhead improvements to data pipeline deployment, monitoring tools, and self-service capabilities that empower data teams to deliver insights faster with higher reliability
Engineer Robust Data Platforms - Build and optimize infrastructure that supports diverse data workloads from real-time streaming to batch processing, ensuring performance and cost-effectiveness for critical analytics systems
Drive DataOps Excellence - Collaborate with engineering leaders across data teams, champion modern infrastructure practices, and mentor team members to elevate how we build, deploy, and operate data systems at scale.
Requirements:
3+ years of hands-on DevOps experience building, shipping, and operating production systems.
Coding proficiency in at least one language (e.g., Python or TypeScript); able to build production-grade automation and tools.
Cloud platforms: deep experience with AWS, GCP, or Azure (core services, networking, IAM).
Kubernetes: strong end-to-end understanding of Kubernetes as a system (routing/networking, scaling, security, observability, upgrades), with proven experience integrating data-centric components (e.g., Kafka, RDS, BigQuery, Aerospike).
Infrastructure as Code: design and implement infrastructure automation using tools such as Terraform, Pulumi, or CloudFormation (modular code, reusable patterns, pipeline integration).
GitOps & CI/CD: practical experience implementing pipelines and advanced delivery using tools such as Argo CD / Argo Rollouts, GitHub Actions, or similar.
Observability: metrics, logs, and traces; actionable alerting and SLOs using tools such as Prometheus, Grafana, ELK/EFK, OpenTelemetry, or similar.
You might also have
Data Pipeline Orchestration - Demonstrated success building and optimizing data pipeline deployment using modern tools (Airflow, Prefect, Kubernetes operators) and implementing GitOps practices for data workloads
Data Engineer Experience Focus - Track record of creating and improving self-service platforms, deployment tools, and monitoring solutions that measurably enhance data engineering team productivity
Data Infrastructure Deep Knowledge - Extensive experience designing infrastructure for data-intensive workloads including streaming platforms (Kafka, Kinesis), data processing frameworks (Spark, Flink), storage solutions, and comprehensive observability systems.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8454296
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
21/12/2025
Location: Tel Aviv-Yafo
Job Type: Full Time
We are seeking a highly skilled Senior Networking AI Platform Engineer to join our Applied Networking AI group. In this role, you will help design and develop cutting-edge AI solutions, integrating them seamlessly into a variety of products. Youll collaborate closely with multi-functional teams of data scientists, software engineers, and DevOps professionals to ensure the efficient deployment, monitoring, and optimization of machine learning (ML) models.
As a key contributor, you will drive the entire software development lifecycle-from conceptualization and architecture to implementation and production-while working closely with engineering teams to solve complex problems and help build a successful company practice.
What you'll be doing:
Lead the design, development, and deployment of robust software systems across different platforms and environments
Architect, design, and implement scalable and high-performance software solutions, handling complex requirements and integrating various subsystems
Ensure systems are maintainable, flexible, and well-documented, with an emphasis on performance and security
Adapt to new tools, technologies, and frameworks, and be capable of taking ownership of the development process from conception to deployment
Supply innovative ideas and solutions, driving continuous improvement in both code quality and system efficiency
Develop and maintain scalable infrastructure for handling and deploying security and networking ML models in production, ensuring high availability, scalability, performance.
Design and implement data pipelines to efficiently process and transform large volumes of data for training and inference purposes.
Optimize and fine-tune ML models for performance, scalability, and resource utilization, considering factors such as latency, efficiency, and cost.
Collaborate with data scientists and software engineers to operationalize and deploy ML models, including model versioning, packaging, and integration with existing systems.
Requirements:
Bachelors or masters degree in computer science, Data Science, or a closely related discipline.
Over 5 years of experience in software development and/or MLOps.
Strong proficiency in programming languages such as Python, Java, C++.
Deep understanding of cloud services architecture and the ability to create real-world applications that include telemetry, authentication, authorization, and security standard methodologies.
Proven track record of leading complex software projects from concept to delivery.
A "can do" attitude with exceptional problem-solving skills and the ability to thrive in fast-paced environments..
Strong problem-solving skills and ability to solve and resolve sophisticated issues in a timely manner.
Excellent communication and collaboration skills, with the ability to work effectively in multi-functional teams.
Attention to detail and a focus on quality, ensuring robustness and reliability in production ML systems.
Experience with Kubernetes architecture and management is a plus.
Ways to stand out from the crowd:
Exude high energy and a positive attitude.
Stellar verbal and written communication skills.
Passionate about data science and implementation.
Have data science and GPU performance experience.
Want to make what was impossible possible!
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8465950
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
Location: Tel Aviv-Yafo
Job Type: Full Time
It starts with you - a senior ML engineer responsible for building, training, evaluating, and operating machine learning systems in production. The role focuses on data pipelines, model training, experimentation, evaluation, and scalable deployment.
If you want to grow your skills building AI products for mission-critical AI, join mission - this role is for you.
:Responsibilities
Design, train, and evaluate ML models for production use.
Build and maintain data pipelines for training, validation, and inference.
Own experimentation workflows: feature engineering, training runs, and comparison.
Implement model evals, monitoring, and drift detection.
Package and deploy models to production systems.
Optimize training and inference performance, cost, and reliability.
Collaborate with data, platform, and product teams.
Mentor engineers and promote ML engineering best practices.
Requirements:
4+ years software engineering experience with 2+ years applied ML in production.
Strong foundations in machine learning, statistics, and data analysis.
Hands-on experience with model training frameworks (e.g., PyTorch, TensorFlow, JAX).
Experience with distributed training and large-scale datasets.
Experience building data pipelines, feature engineering, and dataset versioning.
Proven experience designing and operating ML evals, experiment tracking, and monitoring.
Familiarity with feature stores, model registries, and ML lifecycle management.
Experience with model serving patterns and production deployment.
Proficiency in Python and strong system design skills.
Experience deploying ML systems on Kubernetes or similar platforms.
Familiarity with GPU acceleration and performance optimization
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504212
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Were hiring a ML Engineer to accelerate AI-driven innovation across Stamplis B2B SaaS platform.
Youll be at the forefront of building intelligent systems that power core product experiences and automate internal operations, driving efficiency, speed, and scale across the organization. This is a high-impact, hands-on role in a fast-growing, AI-first company where machine learning is a foundational pillar, not a bolt-on feature. You'll partner with product, engineering, and operations teams to design and implement powerful ML and LLM-based solutions that make a measurable difference.
What You Will Do:
Build Intelligent Systems: Design and develop ML/LLM-powered solutions that solve real-world challenges across Stamplis product and internal workflows.
Own Full Lifecycles: Take projects from concept all the way to production, including model training, evaluation, integration, and monitoring.
Leverage State-of-the-Art Tools: Work with leading frameworks like LangChain, Hugging Face, TensorFlow, and PyTorch to deliver cutting-edge functionality.
Collaborate Cross-Functionally: Partner with product managers, engineers, and stakeholders to embed AI capabilities into user-facing features and backend services.
Ship at Scale: Build and maintain scalable APIs and services, integrating best practices in CI/CD, observability, and cloud infrastructure.
Report with Impact: Share progress, challenges, and results clearly with technical and executive stakeholders.
Requirements:
6+ years of experience as a Backend Developer, Data Engineer, or ML Engineer
Bachelors degree in Computer Science or a related STEM field
Strong proficiency in Python and ML tooling
Proven ability to build production-grade ML systems end-to-end
Deep experience with LLMs and ML frameworks (e.g., LangChain, LangGraph, Hugging Face, TensorFlow, PyTorch)
Solid foundation in system design, architecture, and microservice patterns
Excellent problem-solving skills and ownership mindset
Strong collaboration and communication abilities
Bonus if you have:
M.Sc. in Computer Science, Software Engineering, or similar field
Experience building and scaling LLM-powered applications
Familiarity with AWS and DevOps best practices (CI/CD, monitoring, IaC)
Exposure to NoSQL and real-time data processing pipelines
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8499639
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
3 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world ML services and customer deployments. In this role, you'll own reliability, velocity, and cost efficiency for products and customer deployments.
:Responsibilities
Design, develop, build and maintain the best solutions for our production platform.
Everything as a code approach (IaC): Run our infrastructure with a wide range of technologies including Ansible, Terraform, and Kubernetes
Work closely with our data scientists and developers to create training, inference and serving pipelines.
Build and maintain tools for automation, deployment, monitoring, and operations.
Troubleshoot issues in our development, production, and test environments
Requirements:
At least 4-5 years of experience in one of the following roles: DevOps, MLOps.
Excellent communication and people oriented
Experience with design, build, development and maintenance of DevOps solutions.
Experience with one of the major cloud providers: AWS, GCP, Azure.
Experience Working cloud & on-prem environments and solutions.
Solid Linux system expert skills - a must
Vast Experience with applications and tooling including Kubernetes, Helm, Terraform, Ansible, SQL/NoSQL/Graph DBs, MLFlow, Jenkins, GitHub, etc.
Experienced with CI\CD technologies.
Experience with bootstrapping projects, introducing new technologies and building systems from scratch.
Good coding capabilities (python\bash etc.)
Advantage- Experience working on endpoint products (agent/sensors/collectors)
Advantage- Experience working on AI components (Training, inference, serving)
Tech stack:
AWS, Kubernetes, EKS, ECS, Jenkins, IaC, GitHub, Terraform, Python, Ansible, Docker+Compose, ArgoCD, MongoDB, RabbitMQ, Redis, Go, Neo4J, AI, MLFlow, Clickhouse, Jupyter and more.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8504090
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a talented and motivated Software Engineer to join our newly formed team developing orchestration tools and platforms for AI datacenters.
The main goal of this team is to create customer-focused orchestration solutions that simplify the deployment, management, and optimization of large-scale AI workloads across a full datacenter stack - including switches, hosts, smart NICs, GPUs, ROCm, and RCCL.
You will work on the design and development of orchestration systems that bridge compute, networking, and AI acceleration domains, primarily using Python and modern full-stack technologies.
Key Responsibilities
* Design and develop software components for orchestration platforms managing AI datacenter infrastructure.
* Implement control and coordination mechanisms for compute, network, and AI acceleration resources.
* Develop backend services, APIs, and UI components using Python and modern full-stack frameworks.
* Collaborate with cross-functional teams - including networking, GPU, and system software - to integrate orchestration capabilities across multiple layers.
* Participate in architecture discussions, code reviews, and continuous integration processes.
* Contribute to testing, validation, and performance improvements of orchestration systems.
* Engage with product and customer teams to translate operational needs into effective software solutions.
Preferred Qualifications
* Exposure to **AI workloads** and GPU ecosystems (ROCm, RCCL, PyTorch, TensorFlow).
* Experience with **distributed systems, control-plane software, or cluster management frameworks**.
* Familiarity with **REST/gRPC APIs**, **microservices**, and **cloud-native architectures**.
* Background in **monitoring, telemetry, or resource scheduling systems**.
* Practical experience in **full-stack development** (React, Angular, Node.js, or equivalent).
* Experience with **test automation frameworks** (pytest, Robot Framework, etc.).
Requirements:
3+ years of experience in software development, preferably in infrastructure, orchestration, or systems software.
Strong proficiency in Python, including experience with backend or orchestration frameworks.
Familiarity with datacenter or cloud infrastructure, including networking, compute, or storage systems.
Experience with containers and orchestration platforms (Docker, Kubernetes).
Solid understanding of software engineering principles, including design patterns, testing, and CI/CD.
Strong collaboration and communication skills, with the ability to work in a multidisciplinary environment.
Preferred Qualifications
Exposure to AI workloads and GPU ecosystems (ROCm, RCCL, PyTorch, TensorFlow).
Experience with distributed systems, control-plane software, or cluster management frameworks.
Familiarity with REST/gRPC APIs, microservices, and cloud-native architectures.
Background in monitoring, telemetry, or resource scheduling systems.
Practical experience in full-stack development (React, Angular, Node.js, or equivalent).
Experience with test automation frameworks (pytest, Robot Framework, etc.).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8485588
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
21/12/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Run:ai, now part of our company, has evolved AI infrastructure by merging GPU virtualization with Kubernetes-native capabilities. Our world class AI platform allows organizations to improve productivity and efficiency for data scientists and machine learning engineers. With deep Kubernetes expertise and a focus on innovation, we are dedicated to developing groundbreaking technologies. We deliver the best user experience for our customers and provide detailed access to workload performance through rich metrics. These metrics help users optimize their AI workloads. We are looking for highly skilled Devops engineers to join our Infrastructure Group and help shape the future of AI infrastructure.
What you'll be doing:
Take full end-to-end ownership of our cloud infrastructure, spanning development environments to production systems across various cloud platforms.
Compose, build, and develop the architecture of Run:AI cloud-native products for a variety of complex customer environments, including on-premise, and cloud
Identify and fix production issues while addressing performance challenges to ensure our systems operate flawlessly.
Collaborate closely with cross-functional groups to provide architectural and infrastructure input that builds product direction and composition.
Partner with R&D, Customer Success, Professional Services, and Pre-sales teams.
Continuously evaluate and implement new tools and technologies to improve our release and product deployment processes.
Requirements:
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
At least 5 years of direct experience working in a large-scale software development setting as a DevOps Engineer.
Advanced knowledge of Kubernetes, supported by a minimum of 4 years of practical experience.
Strong knowledge of Linux, networking, storage, and security.
Extensive experience with cloud platforms such as AWS, GCP, Azure, or OCI (at least one is required).
Proven track record managing production environments, including monitoring and logging solutions.
Excellent Bash/Shell scripting skills, along with experience scripting in Python or Go.
Strong software engineering capabilities in backend systems and databases.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8465214
סגור
שירות זה פתוח ללקוחות VIP בלבד