The SRE team is responsible for ensuring the reliability, stability of our cloud systems. Our role bridges engineering and operations, combining deep technical understanding with proactive system oversight.
We maintain strong expertise in DAC components, continuously improve monitoring and alerting, build automations and self healing mechanisms, and work to reduce dependencies on other engineering groups. During production incidents, the SRE team leads the technical and communication flow e2e from severity assessment and cross team coordination to investigation, resolution, and RCA ownership. Our mission is to keep the platform scalable, and ready to meet evolving customer and product needs while driving innovation in observability, automation, and operational excellence.
Responsibilities:
Monitor, manage and operate our cloud services including incident management.
Scale our service with required monitoring and alerting capabilities.
Develop tools and automations based on C# .Net and Python to support our operation and growth.
Work closely with R&D to make sure new features are reliable, easily deployable, and support the requirements of the service in terms of scale and security.
Establish a regular operational feedback cycle into our engineering teams.
Manage the Service Operations team to operate with a culture of business and customer-centricity by maintaining SLA for each service, including incident response, problem management, and service upgrades.
Develop and drive, as the primary owner, the communication strategy for internal and external stakeholders (including customers) to convey service health, tracking against SLAs, current and historical incidents, upcoming events, or upgrades.
Ensure all technical procedures are documented, reviewed, and updated and actively contribute to the maintenance of operational standards & policies.
Collaborate with the Support team to understand and improve user experience, performance, incident response, and the serviceability of our offerings.
Collaborate with the internal R&D team to automate infrastructure services and system administration tasks wherever possible and implement a monitoring strategy to provide rapid feedback and diagnostics in the event of a service disruption.
Create relationships with other departments, including Marketing, Product Management, Engineering, and Customer Success, to make sure we provide services with high availability and superior performance for all our customers.
Requirements: At least 5 years of relevant industry experience in maintaining a high-availability production environment (SRE OR Automation).
At least 2 years experience in developing Python applications.
In-depth understanding of the entire web development process (design, development, and deployment)
Substantial experience in operating a high-availability cloud infrastructure.
Quick technology adaptation
Good interpersonal skills
BSc in computer science or a related field
This position is open to all candidates.