דרושים » תוכנה » Principal Machine Learning Engineer GenAI Benchmarking & Validation Infrastructure

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 5 שעות
Location: Ra'anana
Job Type: Full Time and Hybrid work
Required Principal Machine Learning Engineer GenAI Benchmarking & Validation Infrastructure
The Principal Machine Learning Engineer GenAI is responsible for hands-on design, development, and operation of large-scale systems and tools for AI model benchmarking, optimization, and validation.
Unlike traditional ML Engineers focused mainly on training models, this role centers on building, running, and continuously improving the infrastructure, automation, and services that enable rigorous, repeatable, and production-grade model evaluation at scale.
This is a hands-on principal role that combines strategic technical leadership with active engineering execution.
You will own the architecture, implementation, and optimization of benchmarking and validation capabilities across our AI ecosystem. This includes architecting Validation-as-a-Service platforms, delivering high-performance benchmarking pipelines, integrating with leading GenAI frameworks, and setting industry standards for model evaluation quality and reproducibility.
The role demands deep GenAI domain expertise, architectural foresight, and direct coding involvement to ensure evaluation platforms are flexible, extensible, and optimized for real-world, large-scale use.
What you will do
Architect and lead scalable benchmarking pipelines for LLM performance measurement (latency, throughput, accuracy, cost) across multiple serving backends and hardware types.
Build optimization & profiling tools for inference performance, including GPU utilization, memory footprint, CUDA kernel efficiency, and parallelism strategies.
Develop Validation-as-a-Service platforms with APIs and self-service tools for standardized, on-demand model evaluation.
Integrate and optimize model serving frameworks (vLLM, TGI, LMDeploy, Triton) and API-based serving (OpenAI, Mistral, Anthropic) in production environments.
Establish dataset & scenario management workflows for reproducible, comprehensive evaluation coverage.
Implement observability & diagnostics systems (Prometheus, Grafana) for real-time benchmark and inference performance tracking.
Deploy and manage workloads in Kubernetes (Helm, Argo CD, Argo Workflows) across AWS/GCP GPU clusters.
Lead performance engineering efforts to identify bottlenecks, apply optimizations, and document best practices.
Stay ahead of the GenAI ecosystem by tracking emerging frameworks, benchmarks, and optimization techniques, and integrating them into the platform.
Requirements:
Advanced Python for ML/GenAI pipelines, backend development, and data processing.
Kubernetes (Deployments, Services, Ingress) with Helm for large-scale distributed workloads.
Deep expertise in LLM serving frameworks (vLLM, TGI, LMDeploy, Triton) and API-based serving (OpenAI, Mistral, Anthropic).
GPU optimization mastery: CUDA, mixed precision, tensor/sequence parallelism, memory optimization, kernel-level profiling.
Design and operation of benchmarking/evaluation pipelines with metrics for accuracy, latency, throughput, cost, and robustness.
Experience with Hugging Face Hub for model/dataset management and integration.
Familiarity with GenAI tools: OpenAI SDK, LangChain, LlamaIndex, Cursor, Copilot.
Argo CD and Argo Workflows for reproducible ML orchestration.
CI/CD (GitHub Actions, Jenkins) for ML workflows.
Cloud expertise (AWS/GCP) for provisioning, running, and optimizing GPU workloads (A100, H100, etc.).
Monitoring and observability (Prometheus, Grafana) and database experience (PostgreSQL, SQLAlchemy).
Nice to Have
Distributed training across multi-node, multi-GPU environments.
Advanced model evaluation: bias/fairness testing, robustness analysis, domain-specific benchmarks.
Experience with OpenShift/RHOAI for enterprise AI workloads.Benchmarking frameworks: GuideLLM, HELM (Holistic Evaluation of Language Models), Eval Harness.
Security scanning for ML artifacts and containers (Trivy, Grype).
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8345088
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 5 שעות
Location: Ra'anana
Job Type: Full Time and Hybrid work
The OpenShift AI team is looking for a Machine Learning Engineer with experience in building, scaling, and monitoring AI/ML systems to join our rapidly growing engineering team. Our focus is to create a platform, partner ecosystem, and community by which enterprise customers can solve problems to accelerate business success using AI. This is a very exciting opportunity to shape the deployment and reliability of GenAI workloads, contribute to the development of the RHOAI product, participate in open source communities, and be at the forefront of the exciting evolution of AI. Youll join an ecosystem that fosters continuous learning, career growth, and professional development.
As a core ML engineer for one of our OpenShift AI teams, you will have the opportunity to design and build systems that monitor, validate, and improve AI model performance in production. You will work as part of an evolving development team to rapidly design, secure, build, test, and release new capabilities. The role is primarily an individual contributor who collaborates closely with other ML engineers, software developers, and cross-functional teams. You should have a passion for observability, MLOps, and building robust systems for real-world AI.
Our commitment to open source innovation extends beyond our products - its embedded in how we work and grow. we embrace change especially in our fast-moving technological landscape and have a strong growth mindset. That's why we encourage our teams to proactively, thoughtfully, and ethically use AI to simplify their workflows, cut complexity, and boost efficiency. This empowers our associates to focus on higher-impact work, creating smart, more innovative solutions that solve our customers' most pressing challenges.
What you will do:
Design and build observability and assistance tools to help customers optimize large-scale AI initiatives running on Kubernetes
Innovate in the MLOps and AI observability and deployment optimization domains by contributing to upstream communities
Collaborate with product, engineering, and research teams to improve model trust and performance
Write unit and integration tests and work with quality engineers to ensure product quality
Use CI/CD best practices to deliver solutions into RHOAI as part of our productization efforts
Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality.
Contribute to a culture of continuous improvement by sharing technical knowledge and insights
Communicate effectively with stakeholders and team members to ensure visibility of ML performance
Represent RHOAI in external engagements including open source communities and customer meetings
Mentor and guide junior engineers and contribute to team growth
Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI.
Requirements:
Experience in machine learning engineering, with a focus on production-grade systems
Proficiency in Python with a focus on AI/ML infrastructure or tooling
Hands-on experience with source control tools such as Git
Passion for open-source technology and collaborative development
Strong troubleshooting skills and system-level thinking
Ability to work autonomously and thrive in a fast-paced environment
Excellent written and verbal communication skills
The following will be considered a plus:
Masters degree or higher in computer science, machine learning, or related discipline
Experience working with Kubernetes, OpenShift, or other cloud-native platforms
Familiarity with ML observability tools (e.g. Prometheus, OpenTelemetry, and Grafana)
Contributions to open-source projects, especially in the MLOps or ML observability domain
Experience with public cloud services (AWS, GCP, Azure).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8345125
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 5 שעות
Location: Ra'anana
Job Type: Full Time and Hybrid work
The Ecosystems Engineering group is seeking a Senior Principal Software Engineer to join our rapidly growing team. This is a game-changing opportunity to join an open-source AI platform that harnesses the power of hybrid cloud to drive innovation. In this role, you will work with a diverse team of highly talented engineers on designing, implementing, and productizing new AI solutions, with a focus on deep integration of the AI stack, hardware accelerators, and leading OEMs and Cloud Computing Service Providers (CCSPs).
You'll play a critical role in shaping the next generation of hybrid cloud platforms by directly contributing to our innovative AI and Edge products. This is your chance to be at the forefront of AI's exciting evolution, joining an ecosystem that champions continuous learning, career growth, and professional development. You'll also collaborate closely with product management, other engineering teams, and key partners and lighthouse customers.
What You Will Do:
Architect and lead the implementation of new features and solutions for our AI and Edge products.
Explore deep code integration into various products, ensuring optimal integration between our portfolio, hardware accelerators and partners.
Provide technical vision and leadership on critical and high-impact projects, ensuring non-functional requirements including security, resiliency, and maintainability are met.
Integrate software that leverages hardware accelerators (e.g., DPUs, GPUs, AIUs) and perform performance analysis and optimization of AI workloads with accelerators.
Work with major AI and hardware partners such as NVIDIA, AMD, Dell, and others on building joint integrations and products.
Collaborate closely with UX, UI, QE, and cross-functional teams to deliver a great experience to our partners and customers.
Coordinate with team leads, architects, and other engineers on the design and architecture of our offerings.
Become responsible for the quality of our offerings, participate in peer code reviews and continuous integration (CI), and respond to security threats.
Mentor, influence, and coach a distributed team of engineers, contributing to a culture of continuous improvement by sharing recommendations and technical knowledge.
Requirements:
10+ years of relevant technical experience in software development.
Advanced experience working in a Linux environment with at least one language like Golang, Rust, Java, C, or C++.
Advanced experience with a container orchestration ecosystem like Kubernetes, or OpenShift.
Strong experience with microservices architectures and concepts including APIs, versioning, monitoring, etc.
Experience with AI/ML technologies, including foundational frameworks, large language models (LLMs), Retrieval Augmented Generation (RAG) paradigms, vector databases, and LLM orchestration tools.
Ability to quickly learn and guide others on using new tools and technologies.
Proven ability to innovate and a passion for staying at the forefront of technology.
Excellent system understanding and troubleshooting capabilities.
Autonomous work ethic, thriving in a dynamic, fast-paced environment.
Technical leadership acumen in a global team environment.
Proficient written and verbal communication skills in English.
The Following is Considered a Plus
Experience with cloud development for public cloud services (AWS, GCE, Azure).
Familiarity with virtualization, networking, or storage.
Background in DevOps or site reliability engineering (SRE).
Experience with hardware accelerators (e.g., GPUs, FPGAs) for AI workloads.
Recent hands-on experience with distributed computation, either at the end-user or infrastructure provider level.
Experience with performance analysis tools.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8345118
סגור
שירות זה פתוח ללקוחות VIP בלבד