We are looking for a highly experienced Site Reliability Engineer (SRE) or DevOps Engineer to join our passionate Engineering team. In this critical role, you will be instrumental in ensuring the reliability, performance, and scalability of our core infrastructure. You will work with the latest cloud technologies, focusing on automating and optimizing our continuous integration and continuous deployment pipelines, and managing our Kubernetes environment.
What Youll Do
Analyzes and determines integration needs.
Automates infrastructure and application deployment on AWS.
Identify manual processes that can be automated
Maintain and improve our cloud infrastructure
Continuously maintain and improve our CI/CD
Design, implement, and maintain scalable and highly-available infrastructure systems, focusing on reliability and performance.
Develop and implement robust monitoring, alerting, and logging solutions to proactively identify and resolve potential system issues.
Conduct blameless post-mortems for critical incidents, driving continuous improvement in system resilience.
Participate in capacity planning and performance tuning to ensure the platform can handle current and future load.
Must - Be available on-call to respond to and resolve critical infrastructure issues outside of regular business hours. (including weekends)
Requirements: 5+ years of experience as DevOps or SRE engineer role
Experience designing and operating large-scale distributed systems.
Deep understanding of SRE principles and practices (SLOs/SLIs, Error Budgets, Toil reduction).
Kubernetes cluster administration working knowledge (preferably EKS), using Helm, gitops.
Scripting and automation skills (Shell, Python, etc.)
Experience using a broad range of AWS technologies (EC2, S3, VPC, Lambda, IAM, CloudWatch, etc.)
Proven record of build automation and CI/CD pipelines, including github actions, ArgoCD, FluxCD)
Experience working with monitoring frameworks like Grafana, DataDog, Prometheus, ELK
Experience with cloud-managed database services (e.g., AWS RDS, Redis, DynamoDB).
Knowledge of DNS, Load Balancing, SSL, TCP/IP, networking, and security
Provision infra using IaC tools such as Teraform, serverless framework, Cloudformation, Pulumi.
Experience with DB administration and maintenance
Outstanding interpersonal communication skills
This position is open to all candidates.