Were looking for an Engineering Team Lead, who will be responsible for the foundational infrastructure framework used by all engineering teams to build, deploy, and operate AI agents safely in production. We are building the "operating system" for AI , covering agent sessions, memory management, tool orchestration, durable execution, and multi-tenant isolation. You will lead a high-impact team of 3-4 engineers to create the runtime and platform that defines the future of autonomous enterprise intelligence.
Youll Own:
Agentic Framework Architecture: Designing and building internal agentic framework, leveraging and integrating industry-standard tools such as LangChain, LangSmith, ADK, and similar ecosystems.
Evaluation and Quality Systems: Building evaluation frameworks and workflows for AI agents, including offline and online evaluations, quality metrics, regression detection, and experimentation infrastructure.
Team Leadership & Mentorship: Leading a squad of 3-4 senior engineers, fostering a culture of technical excellence, and managing end-to-end delivery in a fast-paced environment. You will spend approximately 50% of your time hands-on, architecting core systems and reviewing code, and 50% leading the team, mentoring engineers, and aligning with cross-functional stakeholders.
Observability, Monitoring, and Guardrails: Providing the organization with robust observability capabilities for AI agents, including tracing, logging, monitoring, cost tracking, and safety guardrails to ensure reliable and responsible usage.
Developer Enablement Platforms: Creating APIs, SDKs, and abstractions that enable product teams to easily build, test, and operate agents while adhering to platform standards.
Cross-Language Integrations: Designing integrations and tooling across Python and Java to enable seamless adoption of the AI framework within broader backend ecosystem.
Youll Solve:
Agent Lifecycle and Orchestration Complexity: Managing agent execution, tool usage, memory, workflows, and failure modes in production-grade systems.
AI System Reliability at Scale: Ensuring agents remain observable, debuggable, and safe as usage scales across teams and products.
Evaluation and Drift Challenges: Detecting quality regressions, model behavior changes, and unintended agent behaviors through robust evaluation and monitoring systems.
Platform Adoption Friction: Balancing flexibility with guardrails so teams can innovate quickly without compromising reliability, security, or cost controls.
Requirements: 8+ years of backend engineering experience, with strong system design and platform-building expertise. Tech leadership or team leading experience is an advantage.
Strong analytical and problem-solving skills, with the ability to debug and resolve complex technical issues efficiently.
Hands-on experience with agentic systems and frameworks such as LangChain, LangSmith, ADK, or equivalent agent orchestration platforms.
Strong understanding of AI evaluation methodologies, including agent evaluations, prompt evaluation, regression testing, and quality monitoring.
High proficiency in Python for building production-grade AI frameworks and services.
Familiarity with Java and experience integrating backend platforms or tooling into Java-based systems.
Experience building observability, monitoring, or platform tooling for distributed systems.
Strong analytical skills and the ability to reason about complex, evolving AI-driven systems.
Experience with cloud platforms and scalable microservices architectures.
Excellent communication skills and a strong platform mindset, with experience enabling multiple teams.
This position is open to all candidates.