Behind that no-code surface sits a non-trivial backend: a distributed workflow runtime that executes customer-defined business processes, integrates with dozens of third-party systems, and serves real production transactions across 7 regional clusters. We are looking for a Senior Backend Engineer to help us evolve that runtime.
Own architectural decisions on the workflow runtime, single vs. multi-tenancy boundaries, and integration patterns. Write the design doc, defend it in review, ship it.
End-to-end delivery: requirements → design → implementation → deployment → monitoring → post-incident learning-no throwing over the wall.
Partner with Product, DevOps, and the other engineering teams on cross-team initiatives (Celery → temporal.io rollout, customer onboarding to new regions, breaking changes in the workflow runtime).
Mentor mid-level engineers through code review, design review, and on-call shadowing.
Continuously raise the bar on code quality, system stability, and test coverage.
Requirements: 7+ years building production backend systems, including at least 2 years owning a distributed/async runtime (workflow engine, task queue, event-driven system, stream processor - Temporal.io, Celery, Kafka Streams, AWS Step Functions, or equivalent).
Deep hands-on Python (the player and workflow runtime are Python), plus working JavaScript.
Production scale experience and mindset. You've operated systems where throughput, tail latency, and blast radius are first-class design constraints, not afterthoughts.
Production experience operating self-hosted infrastructure on Kubernetes (EKS preferred). Comfortable reading helm charts, debugging pod resource issues, and reasoning about autoscaling.
Comfortable with CI/CD pipelines, infra-as-code, and being on-call for systems you ship.
AI fluency on both sides of the product. You use Claude Code or an equivalent agentic coding tool for real SDLC work (design docs, refactors, on-call triage) and have shipped LLM-backed features in production - agentic workflows, retrieval, tool-use, structured output, with an eval loop and a real opinion on latency, cost, and safety.
Strongly preferred (any one of these moves you up the stack):
Direct experience with Temporal.io in production - workflow versioning, replayer-based regression detection, signal/query/update patterns, codec authoring, per-namespace operational concerns.
Strong grasp of distributed-systems failure modes: idempotency, retries, timeouts, partial failures, non-determinism, exactly-once vs at-least-once semantics. You can debug a "stuck workflow" without a runbook.
Performance optimization in Python at the deserialization/interpreter / JIT layer (cython, mypyc, alternative JS engines, profiling tooling).
Operating modern self-hosted infrastructure at scale. Hands-on with multi-cluster EKS, a metrics + logs stack (Prometheus / Grafana / centralized log aggregation), IaC, and GitOps workflows. You've owned the dashboards and alerts you're on-call against - not just consumed someone else's.
This position is open to all candidates.