What You'll Do:
Be a part of a newly built Site reliability engineering group focused on keeping the different production environments alive and healthy.
Implement, provision, and improve the production services.
Take an active part in the SDLC, from inception and design phase, through deployment, operation, and retirement.
Productionize services with focus on service resilience, transparency, autoscaling, self-healing, cost effectiveness, and security.
Work closely with DevOps, NOC, Service delivery, Data architects, security, products, and project managers.
Requirements:
What You've Done:
5+ years experience working in a similar role supporting and managing a large, distributed scalable production environment, preferably in a Big Data analytics company.
5+ years of experience and deep understanding in Linux systems, including performance tuning, scalability, and redundancy solutions (such as RedHat pacemaker).
Proven experience of provisioning and developing monitoring systems, such as Influx, Grafana, telegraph, and Prometheus.
Proven experience in provisioning, auditing, and logging application and systems in ELK stack.
Elite problem-solving and troubleshooting on the system, platform application, and database level.
Proven experience programming and scripting with Python and Bash.
Vast understanding of the concepts of Infrastructure as Code/immutable environments.
Experience in configuration management tools (Ansible, Terraform, etc.).