As a Large Scale Video Understanding Research Scientist, you will play a key role in improving video generation quality and efficiency by improving video and audio understanding pipelines used for both training data construction and model evaluation.. This role demands hands-on work with large-scale Video Language Models (VLLMs), including fine-tuning, post-training, and control, alongside implementing classic computer vision and signal processing algorithms and applying strong research skills. Your expertise in post-training and controlling large scale foundational models, understanding statistics, implementing complex systems and eliminating bugs will be crucial, as our video training sets consist of petabytes of data processed across hundreds to thousands of virtual machines.
What you will be doing
Fine-tune and control VLLMs for video and audio understanding.
Design algorithms for balancing, filtering, and curating training and evaluation datasets, informed by model behavior and failure modes.
Implement classic and modern algorithms for processing, clustering, evaluation and filtering of large scale datasets.
Work within high-performance, scalable distributed systems capable of handling petabytes of data, with attention to throughput, correctness, and reproducibility..
Collaborate with other researchers and product stakeholders to iteratively improve training sets and evaluation protocols through tight feedback loops driven by model performance.
Requirements: Experience training, fine-tuning, or post-training large-scale VLLMs or multimodal foundation models.
Strong software engineering skills, proficient in Jax or PyTorch.
Ability to develop and implement computer vision models for data filtering and evaluation.
Understanding of relevant topics in statistics, clustering.
Enjoys delving into system implementations to enhance performance and maintainability.
This position is open to all candidates.