We are seeking a Senior DevOps Engineer with extensive experience to lead the design, development, deployment, and operation of large-scale software solutions. This role is a critical bridge between Software Engineering and Infrastructure, demanding a deep proficiency in building and operating reliable, scalable systems within a complex Big Data environment.
What you'll be doing all day:
Own Reliability and Scalability: Lead the architecture and implementation of best practices to ensure high availability, optimal performance, and horizontal scalability of our critical systems, operating within a vast Big Data landscape.
Infrastructure as Code (IaC): Develop, maintain, and evolve our infrastructure using advanced IaC tools (e.g., Terraform or Pulumi), ensuring full automation of service deployment and management across our AWS/GCP cloud environment.
Strategic Collaboration: Partner closely with application software engineering teams to design, conduct code reviews, and implement systems that are stable, secure, and performant.
Observability: Implement and manage robust monitoring, logging, and alerting solutions to enable proactive identification and deep Root Cause Analysis (RCA) of issues.
Automation & Efficiency: Identify and eliminate manual tasks ("Toil") by automating repetitive processes to continuously improve operational efficiency and system reliability.
Production Incident Response: Participate in an on-call rotation to quickly investigate, troubleshoot, and mitigate critical production incidents, driving post-mortems to prevent recurrence.
Performance Engineering: Analyze system performance, conduct performance tuning, and execute capacity planning to meet future demands.
Requirements: Proven Experience: 5+ years of experience as a Production Engineer, DevOps Engineer, or SRE, running and managing large-scale operations on a major cloud provider (AWS or GCP).
Coding Proficiency: 5+ years of experience developing server-side applications or tooling using languages like Python, Java, Node.js, or Go.
Deep Infrastructure Knowledge: Strong understanding of Kubernetes and container orchestration, complemented by solid knowledge of Web Servers (e.g., Nginx), Load Balancers, Caching Systems (e.g., Redis/Memcached), Databases (Relational and NoSQL), and networking fundamentals.
CI/CD & GitOps: Practical experience with modern CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and familiarity with GitOps principles.
Communication: Excellent communication and collaboration skills to coordinate effectively across various R&D and Infrastructure groups.
Passion: Eagerness to take on complex challenges and a continuous desire to learn and implement new, cutting-edge technologies.
This position is open to all candidates.