Responsibilities: Platform & Infrastructure
* Architect, build, and run the AWS/Kubernetes platform that hosts Alice's internal AI agents and tools; drive AWS Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability).
* Own Infrastructure-as-Code: Terraform modules, standards, and reviews for Bedrock, agent runtimes, vector DBs, and supporting services. AI Systems
* Design and ship production-grade agents and multi-agent pipelines using the Anthropic Agent SDK, Claude Code, AWS Bedrock, and MCP - not wrapper frameworks.
* Own the full agent lifecycle: scoping prototyping eval deploy monitor iterate.
* Integrate agentic workflows into internal and product systems via APIs, databases, webhooks, Slack, and email. Reliability, Observability, Cost
* Build first-class observability across apps and infra: OpenTelemetry, Prometheus, plus LLM-specific tracing (Langfuse or equivalent), token/cost metrics, and eval pipelines.
* Define SLOs/SLIs and error budgets for AI services - latency, model fallback chains, eval regression gates, agent success rates. Lead incident readiness, response, and post-mortems.
* Drive FinOps: model routing by cost, cache hit rates, batch vs. realtime tradeoffs, budget alarms, per-team chargeback visibility.
* Implement guardrails: prompt-injection defenses, PII redaction, model allowlists, human-in-the-loop checkpoints, audit trails. Org Impact
* Identify high-leverage workflows across the organization and translate them into scalable agentic automations.
* Partner with R&D, Delivery, security, and external vendors to deliver platform capabilities.
About Alice:
Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact-whether with each other or with machines. In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection.
Requirements: Requirements (must-have) 3-5 years in software engineering, shipping and operating production-grade systems. 2+ years hands-on AWS, Kubernetes, and Terraform in production - not familiarity, ownership. 1-2 years hands-on building and deploying LLM-powered or agentic systems in production.
* Proficiency in PythongreenTxtBg!: async patterns, REST APIs, cloud-native architectu
This position is open to all candidates.