We are seeking a Site Reliability Engineer Technical Lead who excels at bridging the gap between infrastructure and development. In this role, you will guide and mentor a team of SREs, working closely with engineering and product teams to ensure the reliability, scalability, performance, and efficiency of our systems.
A strong emphasis will be placed on observability, designing and implementing effective monitoring, logging, tracing, and alerting solutions to provide deep visibility into system behavior. In addition, you will be responsible for supporting FinOps practices, ensuring cost optimization and financial visibility as part of the overall reliability strategy.
You should be comfortable collaborating with developers, presenting technical insights, helping shape best practices, and influencing decisions around both system health and efficiency. Your responsibilities will include incident management, automation, improvement of observability solutions, cost and performance tuning, and ensuring our platform evolves with our business needs.
Role:
Ensure production systems meet or exceed established SLAs and SLOs by actively maintaining and enhancing system performance, uptime, and efficiency.
Design and maintain end-to-end observability systemsincluding monitoring, logging, and distributed tracingto detect anomalies, reduce noise, and enable proactive issue resolution.
Work closely with engineering teams to improve monitoring and alerting of their applications, ensuring accountability for the operational health of services.
Contribute to FinOps initiatives by providing visibility into cloud usage, identifying opportunities for optimization, and supporting cost-aware engineering decisions.
Optimize application performance and resource usage on Kubernetes through tuning, scaling strategies, and deep performance analysis.
Provide guidance on reliability-first design, instrumenting code for observability, and using dashboards (Grafana, Cost Explorer, or similar) to drive incident response and operational insights.
Support incident management processes and postmortems, driving continuous improvement across reliability and efficiency.
Mentor SREs and advocate for best practices in automation, observability, reliability, and financial accountability.
Requirements: 7+ years in SRE, DevOps, or Production Engineering roles.
Experience contributing to FinOps practices such as cloud cost optimization, financial observability, or cost governance in engineering teams.
Deep expertise in AWS, Kubernetes, Linux.
Hands-on experience deploying and tuning monitoring tools (Prometheus, Thanos, time-series databases) and logging systems (ELK stack, Loki, Grafana or alternatives).
Strong background with tracing solutions (OpenTelemetry, Tempo, Jaeger).
Familiarity with cloud cost tooling (AWS Cost Explorer, Kubecost, Cloudability, or similar).
Strong incident management skills and ability to balance reliability, performance, and cost efficiency.
Experience with automation tools and practices for deployment and infrastructure management.
Excellent communication and collaboration skills, with the ability to influence both technical and business stakeholders.
Ownership mindset, proactive and reliable.
This position is open to all candidates.