As a Senior Data Engineer, youll collaborate with top-notch engineers and data scientists to elevate our platform to the next level and deliver exceptional user experiences. Your primary focus will be on the data engineering aspectsensuring the seamless flow of high-quality, relevant data to train and optimize content models, including GenAI foundation models, supervised fine-tuning, and more.
Youll work closely with teams across the company to ensure the availability of high-quality data from ML platforms, powering decisions across all departments. With access to petabytes of data through MySQL, Snowflake, Cassandra, S3, and other platforms, your challenge will be to ensure that this data is applied even more effectively to support business decisions, train and monitor ML models and improve our products.
Key Job Responsibilities and Duties:
Rapidly developing next-generation scalable, flexible, and high-performance data pipelines.
Dealing with massive textual sources to train GenAI foundation models.
Solving issues with data and data pipelines, prioritizing based on customer impact.
End-to-end ownership of data quality in our core datasets and data pipelines.
Experimenting with new tools and technologies to meet business requirements regarding performance, scaling, and data quality.
Providing tools that improve Data Quality company-wide, specifically for ML scientists.
Providing self-organizing tools that help the analytics community discover data, assess quality, explore usage, and find peers with relevant expertise.
Acting as an intermediary for problems, with both technical and non-technical audiences.
Promote and drive impactful and innovative engineering solutions
Technical, behavioral and interpersonal competence advancement via on-the-job opportunities, experimental projects, hackathons, conferences, and active community participation
Collaborate with multidisciplinary teams: Collaborate with product managers, data scientists, and analysts to understand business requirements and translate them into machine learning solutions. Provide technical guidance and mentorship to junior team members.
Req ID: 20718
Requirements: Bachelors or masters degree in computer science, Engineering, Statistics, or a related field.
Minimum of 6 years of experience as a Data Engineer or a similar role, with a consistent record of successfully delivering ML/Data solutions.
You have built production data pipelines in the cloud, setting up data-lake and server-less solutions; you have hands-on experience with schema design and data modeling and working with ML scientists and ML engineers to provide production level ML solutions.
You have experience designing systems E2E and knowledge of basic concepts (lb, db, caching, NoSQL, etc)
Strong programming skills in languages such as Python and Java.
Experience with big data processing frameworks such, Pyspark, Apache Flink, Snowflake or similar frameworks.
Demonstrable experience with MySQL, Cassandra, DynamoDB or similar relational/NoSQL database systems.
Experience with Data Warehousing and ETL/ELT pipelines
Experience in data processing for large-scale language models like GPT, BERT, or similar architectures - an advantage.
Proficiency in data manipulation, analysis, and visualization using tools like NumPy, pandas, and matplotlib - an advantage.
Experience with experimental design, A/B testing, and evaluation metrics for ML models - an advantage.
Experience of working on products that impact a large customer base - an advantage.
Excellent communication in English; written and spoken.
This position is open to all candidates.