We are looking for a hands-on data infrastructure and platform engineer with proven leadership skills to lead and evolve our self-service data platform that powers petabyte-scale batch and streaming, near-real-time analytics, experimentation, and the feature platform. Youll guide an already excellent team of senior engineers, own reliability and cost efficiency, and be the focal point for pivotal, company-wide data initiatives- feature platform, lakehouse, and streaming.
Own the lakehouse backbone: Mature Iceberg at high scalepartitioning, compaction, retention, metadataand extend our IcebergManager in-house product to automate the lakehouse management in a self serve fashion.
Unify online/offline for features: Drive Flink adoption and patterns that keep features consistent and low-latency for experimentation and production.
Make self-serve real: Build golden paths, templates, and guardrails so product/analytics/DS engineers can move fast safely.
Run multi-tenant compute efficiently: EMR on EKS powered by Karpenter on Spot instances; right-size Trino/Spark/Druid for performance and cost.
Cross-cloud interoperability: BigQuery + BigLake/Iceberg interop where it makes sense (analytics, experimentation, partnership).
What you'll be doing:
Leading a senior Data Platform team: setting clear objectives, unblocking execution, and raising the engineering bar.
Owning SLOs, on-call, incident response, and postmortems for core data services.
Designing and operating EMR on EKS capacity profiles, autoscaling policies, and multi-tenant isolation.
Tuning Trino (memory/spill, CBO, catalogs), Spark/Structured Streaming jobs, and Druid ingestion/compaction for sub-second analytics.
Extending Flink patterns for the feature platform (state backends, checkpointing, watermarks, backfills).
Driving FinOps work: CUR-based attribution, S3 Inventory-driven retention/compaction, Reservations/Savings Plans strategy, OpenCost visibility.
Partnering with product engineering, analytics, and data science & ML engineers on roadmap, schema evolution, and data product SLAs.
Leveling up observability (Prometheus/VictoriaMetrics/Grafana), data quality checks, and platform self-service tooling.
Requirements: 2+ years leading engineers (team lead or manager) building/operating large-scale data platforms; 5+ years total in Data Infrastructure/DataOps roles.
Proven ownership of cloud-native data platforms on AWS: S3, EMR (preferably EMR on EKS), IAM, Glue/Data Catalog, Athena.
Production experience with Apache Iceberg (schema evolution, compaction, retention, metadata ops) and columnar formats (Parquet/Avro).
Hands-on depth in at least two of: Trino/Presto, Apache Spark/Structured Streaming, Apache Druid, Apache Flink.
Strong conceptual understanding of Kubernetes (EKS), including autoscaling, isolation, quotas, and observability
Strong SQL skills and extensive experience with performance tuning, with solid proficiency in Python/Java.
Solid understanding of Kafka concepts, hands-on experience is a plus
Experience running on-call for data platforms and driving measurable SLO-based improvements.
You might also have:
Experience building feature platforms (feature definitions, materialization, serving, and online/offline consistency).
Airflow (or similar) at scale; Argo experience is a plus.
Familiarity with BigQuery (and ideally BigLake/Iceberg interop) and operational DBs like Aurora MySQL.
Experience with Clickhouse / Snowflake / Databricks / Starrocks.
FinOps background (cost attribution/showback, Spot strategies).
Data quality, lineage, and cataloging practices in large orgs.
IaC (Terraform/CloudFormation).
This position is open to all candidates.