We're looking for a Data Platform Engineer to own and scale the Kubernetes infrastructure powering our large-scale data processing platform.
This is a hands-on role at the intersection of infrastructure and data engineering. You'll operate Kubernetes clusters running thousands of nodes, supporting workloads like Spark, Airflow, and remote shuffle services. Your focus: making distributed data workloads reliable, cost-efficient, and performant at scale.
This is not a traditional DevOps or SRE role. You won't be building CI/CD pipelines or managing web services. Instead, you'll be deep in Spark executor scaling, shuffle optimization, batch scheduler tuning, and capacity planning for clusters that process massive datasets daily.
If you've tuned Spark on Kubernetes at scale, wrestled with shuffle storage bottlenecks, or optimized batch scheduling across thousands of concurrent pods - this role is for you.
WHAT YOU'LL DO:
Operate and scale Kubernetes clusters with thousands of nodes supporting large-scale Spark and data processing workloads.
Manage and optimize Apache Spark on Kubernetes - executor autoscaling, driver scheduling, resource tuning, spot instance strategies.
Deploy and tune remote shuffle services (e.g., Apache Celeborn) to handle shuffle data at scale across multiple availability zones.
Operate and improve self-hosted Apache Airflow infrastructure on Kubernetes
Configure and optimize batch schedulers (e.g., YuniKorn, Volcano) for gang scheduling, fair-share queuing, and resource prioritization.
Drive cost optimization across large compute fleets - spot vs. on-demand strategies, node right-sizing, autoscaling policies, local SSD utilization.
Support and collaborate with Data Engineering teams on workload. performance, resource allocation, and infrastructure requirements.
Manage infrastructure-as-code (Terraform) and GitOps deployments (ArgoCD, Helm) for data platform services.
Integrate with managed data platforms (e.g., Databricks) and cloud storage for hybrid processing architectures.
Requirements: REQUIREMENTS:
3+ years of experience operating Kubernetes in production at significant scale (hundreds to thousands of nodes).
Hands-on experience with Apache Spark on Kubernetes - you understand executors, drivers, dynamic allocation, shuffle behavior, and how they map to K8s primitives.
Strong understanding of Kubernetes internals - scheduling, resource management, node autoscaling, pod lifecycle, taints/tolerations, local storage
Experience with cloud infrastructure (GCP preferred) - managed Kubernetes, spot/preemptible instances, local SSDs, networking at scale.
Comfortable with infrastructure-as-code (Terraform) and GitOps workflows.
Proficiency in Python or Go.
NICE TO HAVE:
Experience operating Apache Airflow at scale on Kubernetes.
Experience with Apache Celeborn or similar remote shuffle services.
Familiarity with YuniKorn or Volcano batch schedulers.
Experience with Databricks administration and integration.
Knowledge of data formats and storage systems (Parquet, Delta Lake, cloud object storage).
Experience with streaming or messaging systems (Kafka).
Experience with Prometheus/Grafana observability stacks for data platform monitoring.
Contributions to open-source data infrastructure projects.
This position is open to all candidates.