We are seeking an Observability & SRE Expert with HO experience in On-premise and Cloud environments to join us part-time (100 hours monthly).
the role involves practical tasks:
Implement monitoring, logging, tracing, and metrics for On-Premise and Cloud. Manage SLOs, SLAs, SLIs, and Error Budgets.
Monitor, troubleshoot, and resolve performance issues. Handle incidents and perform root cause analysis.
Develop automation for operational processes. Manage infrastructure and write maintenance scripts.
Build intuitive dashboards and optimize alert management.
Cross-Team collaboration: Work with Development, Infrastructure, Security, and DevOps teams.
Apply SRE principles to enhance system reliability and performance.
place of employment: center
scope of hours: part time job.
Requirements: Experience: 7+ years hands-on in Observability, Monitoring, SRE, On-Premise &Cloud.
Tools: Prometheus, Grafana, ELK, Splunk, Datadog, Terraform, Ansible, etc.
Programming: Python, Java
Big Data: Hadoop, Spark, Kafka
Cloud: AWS/GCP/Azure, CloudWatch, Stackdriver.
SLO/SLA/SLI/ Error Budgets: Practical experience.
Incident Management: RCA, incident handling.
automation: Scripting and process automation.
Nice to have:
Degree in Computer Science or a related field.
Kubernetes, Docker experience.
AI& Machine Learning knowledge.
This position is open to all candidates.