דרושים » תוכנה » Large Scale Training Engineer - LTX Model

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 13 שעות
Location: Jerusalem
Job Type: Full Time
Required Large Scale Training Engineer - LTX Model
About the Role
As a Large Scale Training Engineer, you will play a key role in enhancing the training throughput of our internal framework and enabling researchers to pioneer new model concepts. This role demands excellent engineering skills for designing, implementing, and optimizing cutting-edge AI models, alongside writing robust machine learning code and understanding supercomputer performance deeply. Your expertise in performance optimization, understanding distributed systems, and bug elimination will be crucial, as our framework supports extensive computations across numerous virtual machines.
This role is designed for individuals who are not only technically proficient but also deeply passionate about pushing the boundaries of AI and machine learning through innovative engineering and collaborative research.
Key Responsibilities
Profile and optimize the training process to ensure efficiency and effectiveness, including optimizing multimodal data pipelines and data storage methods.
Develop high-performance TPU/GPU/CPU kernels and integrate advanced techniques into our training framework to maximize hardware efficiency.
Utilize knowledge of hardware features to make aggressive optimizations and advise on hardware/software co-designs.
Collaboratively develop model architectures with researchers that facilitate efficient training and inference.
Design, maintain, and evolve a high-quality, shared codebase that emphasizes correctness, readability, extensibility, testing, and long-term maintainability, while balancing performance requirements.
Requirements:
Industry experience with small to large-scale ML experiments and multi-modal ML pipelines.
Strong software engineering skills, proficient in Python, and experienced with modern C++.
Deep understanding of GPU, CPU, TPU, or other AI accelerator architectures.
Enjoy diving deep into system implementations to improve performance without compromising code quality and maintainability.
Passion for driving ML large-scale training workloads efficiently and optimizing compute kernels.
You are encouraged to apply if you meet 3 out of the 5 core qualifications above and are motivated to grow in the remaining areas.
Nice to have
Background in JAX/Pallas, Triton, CUDA, OpenCL, or similar technologies.
Familiarity with Kubernetes-based environments for running and scaling large-scale workloads.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8542181
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Jerusalem
Job Type: Full Time
we are looking for a Senior DevOps Engineer - ML Platform.
AI Engineering's ML-Platform team goal, is to deliver a modern infrastructure and solutions to enhance Algorithm development life cycle and shorten our delivery times. We are an independent group, consisting of excellent and experienced engineers with diverse skills in algorithms, software, and infrastructure. We strive to implement a DevOps culture allowing our engineers to easily collaborate on large-scale products.
We develop cross-company products that enable the research and deployment of state-of-the-art algorithms.
What will your job look like?
Build and maintain infrastructure for large‑scale AI and HPC workloads across on‑prem and cloud environments
Operate and enhance our multi‑cloud, multi‑cluster scheduling platform
Develop automation, tooling, and platform services und Bash
Troubleshoot complex issues across the stack: compute, networking, storage, orchestration, and distributed systems
Improve reliability of critical systems
Collaborate with ML, data, and backend teams to support evolving platform needs
Drive best practices in CI/CD, infrastructure-as-code, and system design
Participate in on‑call rotations for critical infrastructure components
Requirements:
10+ years of hands‑on experience in DevOps, SRE, systems engineering, or similar roles
Linux knowledge, including debugging, performance tuning, ana system internals
Proven experience working with HPC environments, large clusters, or high‑performance compute systems
Solid experience with Kubernetes (EKS or similar managed K8s services)
Knowledge of infrastructure‑as‑code tools(Terraform, Helm, etc.)
Hands‑on experience with:
PostgreSQL or similar relational databases
Elasticsearch or similar search/indexing systems
Prometheus/Thanos/Grafana or similar observability stacks
RabbitMQ or similar messaging systems
Strong proficiency in Bash, networking fundamentals, and debugging distributed systems.
Experience investigating complex issues across compute, storage, networking, and orchestration layers
Advantages:
Experience with multi‑cloud architectures
Experience with workflow orchestration tools such as Argo Workflows (or similar systems like Airflow, Prefect, Flyte)
Familiarity with GPU scheduling, AI/ML pipelines, or data‑intensive workloads
Background in large‑scale distributed systems or platform engineering
Ability to write production‑quality Go (Golang) code
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8513772
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Jerusalem
Job Type: Full Time
we are looking for a Machine Learning Software Engineer, who will be challenged by bridging the gap between cutting-edge machine learning research and robust production deployment.
In this position, you will combine software engineering expertise with machine learning deployment knowledge, responsible for taking our algorithms and developing robust production solutions that serve them at scale.
The work at algorithms department is fast-paced and requires staying ahead of the curve with the latest engineering solutions and best practices adopted across the ML community, while staying informed about emerging solutions in both computer vision and NLP domains and understanding the specific problems they address.
What will your job look like:
Your role will include developing production deployment systems for classical and machine learning algorithms from research and building robust, scalable inference pipelines.
You will develop primarily in Python and infrastructure tools (Kubernetes, Docker, etc.), taking part in both maintaining existing deployment systems and developing new production capabilities.
Finally, you will need to learn and implement new deployment technologies and best practices that can address emerging production challenges as they arise, while staying current with the latest MLOps and inference optimization techniques.
Requirements:
B.Sc. in Computer Science, Software Engineering, or related technical field.
2+ years of experience in production software development, preferably in ML deployment.
Strong problem-solving skills and ability to tackle complex, real-world production challenges.
Proficiency in Python and experience with containerization and orchestration technologies (Docker, Kubernetes)- advantage.
Hands-on experience with model serving frameworks and inference optimization- advantage.
Background in distributed systems and cloud infrastructure- advantage.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8515942
סגור
שירות זה פתוח ללקוחות VIP בלבד