We are committed to providing our customers with reliable and secure services so we are building out our newly formed Site Reliability Engineering team.
As one of the first joiners to our Reliability Engineering Team, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs our databases.
You will collaborate with different teams like Control Plane, Dataplane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems.
You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our services.
You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of our Cloud.
This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance, serverless Cloud.
What will you do?
Collaborate with various engineering teams to design and implement scalable, secure, and highly available systems for us.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for our Cloud.
Ensure all the infrastructure components in our Cloud (including Dataplane, Control Plane and Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
Enhance and refine incident response processes and post-mortem analysis for any outages in our Cloud including working with the support team to communicate to the impacted customers.
Continuously improve the reliability and performance of our services.
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.
Requirements: Bachelors or Masters degree in Computer Science or a related field.
At least 8 years of experience in Site Reliability Engineering or a related field.
Hands on experience with Go and/or Python.
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
Hands on experience with container orchestration tools such as Kubernetes or Docker Swarm.
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
You are a strong problem solver and have solid production debugging skills.
You are passionate about efficiency, availability, scalability, and data governance.
You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
You have a high level of responsibility, ownership, and accountability.
Excellent communication and interpersonal skills.
This position is open to all candidates.