We have been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. Its an outstanding legacy of innovation thats motivated by extraordinary technology and amazing people. We are looking for a highly motivated DevOps/SRE engineer to join the AIR team the Digital Twin for Data Center Simulation web application. Our Air enables cloud-scale efficiency by creating identical replicas of real-world data center infrastructure deployments.
What you'll be doing:
The person will be part of the AIR team that is building the SaaS/IaaS platform for digital twin of AI data centers.
The responsibility specifically is for DevOps, infrastructure and Site Reliability Engineering (SRE) requirements for AIR.
Focus on efficiency by automating repetitive workflows.
Working on microservices based architecture.
Deploying and troubleshooting non-disruptive cloud operations with an emphasis on secure production infrastructure.
Continuous evaluation of existing system and driving improvements.
Managing deployment/upgrade for Operating Systems, Kubernetes(k8s) clusters and/or or other orchestration tools.
Day to day support for engineering activities with CI/CD tools like git, Jenkins.
Efficiently multi-tasking on the different tracks to efficiently address evolving priorities .
Requirements: What we need to see:
BSc in Engineering/ Relevant Certifications/ equivalent experience.
5+ years of experience in complex microservices based architectures.
Highly skilled in Kubernetes and Docker.
Experience in IaaS environment - deploying, configuring, and administering Linux-based bare metal servers.
Strong networking background (VLANs, routing, VPNs).
Experience with relational databases(MySQL) and SQL.
Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts.
Infrastructure as code (IaC) skills in frameworks like Ansible & Terraform.
Expert in AWS.
Knows best practices and discipline of managing and monitoring a highly available and secure production infrastructure.
Ways to stand out from the crowd:
Strong expertise in Infrastructure as a Service (IaaS).
Skills in Linux/Unix Administration.
Experience with Prometheus/Grafana.
Experience with APM tools like Dynatrace, Datadog, AppDynamics, New Relic, etc.
Implemented robust metrics collection and alerting infrastructure.
This position is open to all candidates.