דרושים » תוכנה » מהנדס/ת אמינות אתרים במחלקת פיתוח - Site Reliability Engineer (SRE)

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
10/08/2025
מיקום המשרה: תל אביב יפו
דרוש/ה מהנדס/ת אמינות אתרים לתפקיד הכולל תחזוקת המערכות, זמינותן ותפעולן השוטף כמו גם שלמות הניטור, הגיבוי, הגנה, ההמשכיות העסקית ותחזוקת סביבות הפיתוח מול גופי התשתיות, תוך שיפור מתמיד של תהליכי עבודה ואוטומציה.
דרישות:
תנאי סף
ניסיון קודם בSRE/ DEVOPS / INFRA
הבנה עמוקה בLINUX, יכולת כתיבת סקריפטים, כלי ניטור ולוגים
הכרות עם עולם ה K8S / OpenShift / Containers
היכרות עם עולם ה-VMs, מערכות הפעלה ותקשורת
אוריינטציה ל-Finops
היכרות עם KPIs, SLA, SLO, SLI
עבודה עם כלי ניטור
ניסיון בעבודה עם סביבות קריטיטת ומערכות production
קריטריונים נוספים
תקשורת מעולה וראש גדול
תרבות ארגונית: מחויבות לשיפור מתמיד בחווית המפתח, חדשנות טכנולוגית, חשיבה מערכתית וסקרנות טכנולוגית. המשרה מיועדת לנשים ולגברים כאחד.
 
הסתר
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8296909
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
 
משרה בלעדית
2 ימים
Sela
Job Type: More than one
We are looking for a highly motivated DevOps Engineer with at least 2 years of hands-on experience to join our growing team. You will play a key role in automating, monitoring, and optimizing our cloud infrastructure, CI/CD pipelines, and deployment processes. You will work closely with development, QA, and IT teams to ensure seamless integration and delivery.

Responsibilities:

Design, build, and maintain scalable and secure infrastructure using IaaC tools.
Implement and manage CI/CD pipelines.
Automate deployment and monitoring processes.
Monitor system performance and ensure high availability.
Collaborate with development and QA teams to streamline workflows.
Maintain and optimize cloud-based environments (e.g., AWS, GCP, Azure).
Ensure security and compliance best practices.
Requirements:
2+ years of experience in a DevOps or related engineering role.
Strong experience with Linux system administration.
Proficiency in cloud platforms: AWS (preferred), GCP, or Azure.
Experience with CI/CD tools: Jenkins, GitLab CI, CircleCI, or similar.
Proficient in Infrastructure as Code (IaC) tools: Terraform, CloudFormation, or Pulumi.
Strong scripting skills in Bash, Python, or similar.
Familiarity with Docker and container orchestration using Kubernetes.
Experience with monitoring/logging tools: Prometheus, Grafana, ELK/EFK, Datadog, etc.
This position is open to all candidates.
 
Show more...
הגשת מועמדות
עדכון קורות החיים לפני שליחה
8302149
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
21/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
we are seeking a Site Reliability Engineer who excels at bridging the gap between infrastructure and development. In this role, you will work closely with engineering teams to ensure the reliability, scalability, and performance of our systems. A strong emphasis will be placed on observability - designing and implementing effective monitoring, logging, tracing and alerting solutions to provide deep visibility into system behavior. You should be comfortable collaborating with developers, presenting technical insights, and helping shape best practices. Your responsibilities will include incident management, automation and improvement of our observability solutions, and continuous performance tuning to ensure our platform can scale and evolve with our business needs.

Role:
Ensure production systems meet or exceed established SLAs and SLOs by actively maintaining and enhancing system performance and uptime.
Design and maintain end-to-end observability systemsincluding monitoring, logging, and distributed tracingto detect anomalies and enable proactive issue resolution.
Work closely with engineering teams to improve how their applications are monitored and alerted on. Help define meaningful alerts, reduce noise, and ensure developers are accountable for the operational health of their services.
Optimize application performance on Kubernetes through resource tuning, scaling strategies, and deep performance analysis.
* Provide guidance on reliability-first design, instrumenting code for observability, and using Grafana dashboards to drive decision-making and incident response.
Requirements:
5+ years in SRE, DevOps, or Production Engineering roles
Deep expertise in AWS, Kubernetes, Linux
Being responsible of deploying and tuning monitoring tools like Prometheus, Thanos and any time-series databases for storing metrics.
Logging responsibilities with ELK stack, Loki, Grafana or any alternatives.
Experience with tracing opentelemetry, tempo, jaeger
Strong understanding of incident management processes and best practices.
Experience with automation tools and practices for deployment and infrastructure management.
Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
Ownership mindset, proactive and reliable
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8268431
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
05/08/2025
Location: Tel Aviv-Yafo
Job Type: Full Time
As a Principal DevOps Engineer in our Platform Engineering team, you will lead the design and implementation of cutting-edge CI/CD pipelines and cloud architecture that powers our development environment. You'll drive initiatives to enhance developer productivity through automation, tooling, and infrastructure improvements, working with a modern tech stack including Kubernetes, Python, cloud-native and high-scale technologies.
Your Impact
Architect and implement scalable, resilient CI/CD pipelines and cloud infrastructure that supports our engineering organization's evolving needs
Design and develop internal developer tools and platforms that significantly improve developer experience and productivity
Drive the evolution of our Kubernetes-based deployment infrastructure in Google Cloud Platform, ensuring security, reliability and performance
Optimize and scale our CI/CD infrastructure including Jenkins, GitLab, TeamCity, and artifact management systems
Mentor and guide other engineers on DevOps best practices, infrastructure design, and implementation strategies
Drive adoption of infrastructure-as-code, automated testing, and deployment methodologies
Collaborate with development teams to understand their needs and implement solutions that accelerate their workflow
Establish standards and best practices for infrastructure reliability, observability, and performance.
Requirements:
7+ years of experience in DevOps, Site Reliability Engineering, or Platform Engineering roles
Extensive experience with CI/CD pipeline design and implementation in complex environments
Advanced knowledge of Kubernetes administration, deployment patterns, and ecosystem tools
Strong programming skills in Python with solid understanding of OOP principles and design patterns
Deep understanding of cloud architecture, specifically with Google Cloud Platform services
Proven track record designing and implementing developer tooling and automation
Experience managing containerized applications and services in production environments
Strong system design skills with focus on scalability, reliability, and security
Knowledge of GitOps workflows and infrastructure-as-code using tools like Terraform, Pulumi, or equivalent
Familiarity with GitLab CI administration and pipeline development
participate in an on call rotation for working and non-working hours
Nice-to-Have
Knowledge of observability platforms and practices (Prometheus, Grafana, distributed tracing)
Familiarity with TeamCity administration and pipeline development
Experience implementing security best practices in CI/CD pipelines
Understanding of compliance requirements in software delivery pipelines
Experience with Infrastructure as Code testing frameworks
Knowledge of software architecture patterns and microservices design.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8290390
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
23/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Team up with the DevOps team to design and implement scalable systems that will keep Fiverr running smoothly and support our significant business growth.

You will join an innovative, high-performance team and work with cutting-edge technologies in a dynamic and agile environment.

What am I going to do?
Maintain and build a large-scale, highly available cloud infrastructure focusing on K8S.
Improve resiliency and cost efficiency of our cloud infrastructure.
Automate tasks and error-handling scenarios.
Develop and adopt new tools to make Development and Operations processes at Fiverr more efficient.
Collaborate with developers to optimize service performance, reliability, and scale.
Evolve and maintain Fiverrs AWS infrastructure by improving and adopting new services.
Maintain Fiverr availability by participating in DevOps on-call shifts.
Mentor DevOps engineers.
Requirements:
5+ years of experience as DevOps
Working in a Linux environment
Writing scripts in Python
Production experience with AWS & Kubernetes.
2+ years of experience with CI/CD processes.
Good knowledge of networking concepts (Load Balancers, DNS, VPC)
Experience in designing and maintaining high-availability solutions for large-scale
Experience with monitoring tools and log analytics (Grafana, Prometheus, Graphite)
Experience with IaC tools (Terraform, Terragrunt - advantage )
Development experience - Advantage
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8272679
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
15/07/2025
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
We are looking for a Site Reliability Engineering (SRE) & Production Team Leader to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.
We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.
In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.
About Us
we are a key player the network security field, striving to provide the leading SASE platform in the market. Our innovative approach, merging cloud and on-device protection, redefines how businesses connect in the era of cloud and remote work.
Key Responsibilities
Design, build, and manage our SRE framework to ensure observability, resilience, and high availability.
Develop and automate solutions for proactive monitoring, incident response, and performance optimization.
Improve and maintain our alerting and monitoring stack, leveraging tools like Datadog, Prometheus, and Grafana.
Lead post-mortem analysis and implement continuous improvement initiatives.
Collaborate with DevOps, Engineering, and Product teams to ensure smooth and efficient delivery of reliable services.
Requirements:
SRE & Production Manager with 5+ years of experience in SRE, Production Engineering, or DevOps, including 2+ years in a leadership role.
Experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
Experience with AWS (EKS, EC2, RDS, S3, networking configurations).
It would be great if you also have:
Experience in high-scale distributed systems.
Background in security and compliance for cloud infrastructure.
Understanding of cost optimization and resource management in cloud environments.
Familiarity with machine learning or predictive analytics for proactive reliability management.
Proficiency in Python, Go, or Bash for automation and scripting.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8259881
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
13/07/2025
Location: Tel Aviv-Yafo and Netanya
Job Type: Full Time
At our company, were reinventing DevOps to help the worlds greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if youre willing to do more, your career can take off. And since software plays a central role in everyones lives, youll be part of an important mission. Thousands of customers, including the majority of the Fortune 100, trust our company to manage, accelerate, and secure their software delivery from code to production -- a concept we call liquid software. Wouldn't it be amazing if you could join us in our journey?
We are looking for a Site Reliability Engineering Manager to lead our Israel SRE team. In this role, you'll drive best practices in reliability engineering, ensuring the stability, availability, and performance of our companys SaaS services. You'll collaborate with global SRE leaders, refine processes, and foster a culture of accountability and continuous improvement.
As a Site Reliability Engineering Manager at our company you will
Lead, mentor, and develop a high-performing SRE Israel team, fostering collaboration, innovation, and accountability
Ensure SaaS reliability, performance, and availability, meeting or exceeding service-level objectives
Drive SRE best practices, including capacity planning, incident management, chaos engineering, and disaster recovery
Implement proactive monitoring, alerting, and anomaly detection aligned with SaaS standards
Collaborate with P&E and Cloud engineering teams to embed reliability into the SDLC
Oversee incident management, ensuring swift identification, escalation, and resolution
Maintain comprehensive SRE documentation, including processes, incident reports, and system architecture
Evaluate and adopt tools, technologies, and methodologies to enhance uptime and reliability.
Requirements:
3+ years of management experience leading a team of SRE, DevOps, or a similar SaaS role
Bachelors degree in Computer Science, Engineering, or related field (or equivalent experience)
Strong expertise in cloud platforms (AWS, GCP, or Azure), containers (Kubernetes, Docker), and configuration management (Terraform, Ansible)
Proficiency in Python or Go for automation and system optimization, as well as GitOps experience with SCM tools (e.g., Git, Bitbucket)
Strong leadership, communication, and collaboration skills, working across globally distributed teams
Familiarity with Agile methodologies, CI/CD pipelines, and orchestration tools (Jenkins, ArgoCD, StackStorm)
Familiarity with Chaos Engineering (e.g., Gremlin, Litmus, Chaos Toolkit)
Hands-on with alerting & observability tools (e.g., PagerDuty, OpsGenie, New Relic, Coralogix)
Strong understanding of scalability, high availability, and security best practices in cloud & Kubernetes environments.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8255508
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
14/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a DevOps Engineer (Platform oriented). This role is perfect for a highly experienced and proactive DevOps Engineer with outstanding coding skills, who is passionate about building flawless products in a high-scale production.
This job is located in Tel Aviv (hybrid).
About Us
we are a key player the network security field, striving to provide the leading SASE platform in the market. Our innovative approach, merging cloud and on-device protection, redefines how businesses connect in the era of cloud and remote work.
Key Responsibilities
Designing & developing DevOps processes, methodologies, and tools.
Contribute to the design requirements of new products based on infrastructure as a code.
Work with cross-functional teams to develop high-scale production environment and enable solutions for our business.
Requirements:
4+ years of DevOps experience
2+ years in solutions development for a high-scale production environment, preferably in SaaS.
3+ years of expertise with IaC development (Terraform/Terragrunt/Pulumi).
2+ years of expirance in Obejct orinated development (Python/Go - an advantage).
3+ years of experience in development CI/CD pipelines (Github action an advantage)
Experience with Cloud infrastructure, particularly AWS (API Gateway, ECS, Lambda).
Proficiency in microservices architecture and Linux-based systems.
Familiarity with queues (RabbitMQ, ActiveMQ, Kafka)
Experience with Containerized Environments orchestrator tools
Advantage: proficiency with HashiCorp tools (Consul, Vault, Nomad).
Advantage: knowledge of Configuration management tools (Ansible, Chef)
Networking understanding: network topologies, common protocols and services (DNS, HTTP(S), SSH, CDN, Prox)
Experience with monitoring and logging tools (Prometheus, Grafana, Loki, DataDog).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8258446
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
10/08/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Staff Devops Engineer.
As a Devops Staff Engineer, you will not be assigned a specific R&D group, but will serve as a focal point for the DevOps engineers, to help and support with any issue.
Youll be leading projects that cross DevOps, push forward technical discussions and interact with each DevOps engineer as needed to solve diverse complex problems of high scale.
Youll support multi-region environments, build and maintain tools for automation, deployment, monitoring, and operations.
Youll troubleshoot and resolve issues in our various environments.
Youll play a key role in designing and enforcing infrastructure patterns that support zero-downtime deployments, high resilience, and compliance standards.
Youll collaborate with teams across the company to define and drive forward scalable, production-grade architecture.
Youll conduct periodic on-call duties and emergency response.
Requirements:
10+ years of experience in the industry, including 6+ years of hands-on experience in high-scale SaaS companies or zero-downtime/disaster recovery enterprise environments (e.g., banking, cybersecurity, healthcare, or large-scale cloud platform providers).
5+ years of experience in DevOps roles across a minimum of 2 different companies, with strong hands-on experience in Kubernetes and AWS. Experience with hybrid or multi-cloud architectures is a strong plus.
Experience with on-call duties to manage critical infrastructure and application issues outside business hours, ensuring high availability and reliability.
3+ years of experience with CI/CD tools such as GitLab, GitHub Actions, CircleCI, or similar.
2+ years of experience with programming languages such as Python or TypeScript. Strong Linux administration skills, including debugging and Bash scripting.
2+ years of experience with Terraform (experience with Terragrunt is a plus), as well as GitOps systems such as ArgoCD.
2+ years of experience with configuration management tools such as Ansible, Chef, or Puppet, and monitoring and alerting systems such as Datadog, Splunk, New Relic, or Grafana.
Strong understanding of networking concepts, including VPC, service meshes, routing, DNS, TLS, and firewalls.
Production-oriented mindset with a strong sense of ownership over reliability, scalability, and incident response.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8296098
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
05/08/2025
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a highly talented technical individual to join the Cortex XDR infrastructure team.
The team is responsible for developing automation infrastructure and various cloud based tools and platforms that are used across the research, development and QA departments to ensure the functionality, stability and quality of the XDR product, alongside the efficiency of the infrastructure and process used to build, test and deploy on various clouds and distributions. We believe that the platforms & infrastructure that the team provides are a critical & crucial part of our department's progress to the modern future and one of our key growth factors.
As a Platform engineer you will play a pivotal role in enhancing our development and automation experience by pushing forward modern automation approaches, eliminating manual efforts and introducing new development operations for continuous integration, scale & durability using advanced cloud services. Your expertise will be used in areas such as infrastructure development, cloud based automation, serverless infrastructure, automation as a service, providing technical guidance, and pushing infrastructure\configuration as a code and GitOps approach across the development departments.
To succeed in this role, you should have a strong foundation in modern cloud based automation methodologies and a comprehensive understanding of industry best practices, especially in redundancy and scalability of large systems and the ability to control them via SCM based declarative configs. You should be familiar with modern public clouds approaches and serverless based architectures, including virtualization containers and container based orchestration including multiple Kubernetes based deployments. You should be comfortable engaging in complex technical discussions and advocating for optimal solutions in a fast-paced growing environment as part of our quest for continuous improvement.
Your Impact
Utilize modern technologies including serverless cloud services, Kubernetes, Terraform, among others, and use them all in an infrastructure/configuration as a code GItOps approach to manage everything via source code and continuous integration processes
Design and implement (hands on) the next generation of platforms, automation frameworks, SDKs, and tools to be used across our entire R&D group, and be part of our infrastructure transition to the cloud
Develop and maintain a cloud based test execution system, that supports parallel executions on multiple operating systems and multiple cloud providers and at a very large scale, and by so helping reduce the amount of effort required to perform automatic testing and manual testing, and reduce time to market
Provide tools, systems and simulators for scaling up all lifecycle phases of our products and services including cross company and third party integrations and frameworks to be used in high scale
Introduce progress and help revolutionize our operations and lay the foundation for innovation and growth.
דרישות:
At least 4 years of hands-on experience as one of the following - Platform/InfraOps Engineer, DevOps , Cloud Infrastructure Engineer or equivalent
Hands-on experience working with cloud services in big public Clouds (Azure, AWS, GCP)
Experience with designing and implementing cloud based infrastructure (especially serverless components), alongside using infrastructure as Code tools such as Terraform and Pulumi to automatically build and maintain the provisioned cloud infrastructure
Strong programming skills in Python (or another high level language), with vast experience in Object-Oriented Programming, including Design Patterns, Algorithms and Data Structures
Strong experience with containerization technologies (docker, containerd) and orchestration , especially with various Kubernetes deployments, both self-managed and cloud managed deployments
Strong experience with SCM methodologies (especially Git) and SCM integration workflow. המשרה מיועדת לנשים ולגברים כאחד.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8290267
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer- Infra
Realize your potential by joining the leading performance-driven advertising company!
As a Site Reliability Engineer- infra, on our Infrastructure team at the TLV office, you will play a key role in ensuring the reliability, scalability, and performance of our critical systems. You will be responsible for managing and improving our core infrastructure, with a focus on automation, monitoring, and incident response. You will work with a wide range of technologies, including Kubernetes, monitoring and observability tools, configuration management systems, and core networking services.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure the reliability, availability, and performance of our infrastructure services.
Manage and maintain our Kubernetes infrastructure, including KubeVirt.
Design, implement, and maintain our monitoring and observability stack (SensuGo, VictoriaMetrics, Prometheus, ELK).
Automate infrastructure provisioning, configuration, and deployment processes using Puppet and Ansible.
Manage and maintain core services such as DNS and networking.
Troubleshoot and resolve complex infrastructure issues in a timely and efficient manner.
Participate in on-call rotations and incident response.
Develop and maintain infrastructure-as-code (IaC).
Identify and implement proactive measures to prevent incidents and improve system reliability.
Collaborate with development teams to ensure smooth and reliable deployments.
Contribute to the design and implementation of new infrastructure solutions.
Drive improvements in system architecture, processes, and tools.
Mentor and coach other team members.
Requirements:
5+ years of experience in a Site Reliability Engineering, Systems Engineering, or similar role.
Deep understanding of Site Reliability Engineering principles and practices.
Extensive experience with Kubernetes, including deployment, management, and troubleshooting.
Strong experience with monitoring and observability tools such as SensuGo, Zabbix, VictoriaMetrics, Prometheus, and ELK.
Proficiency in configuration management tools such as Puppet and Ansible.
Solid understanding of Linux internals and networking.
Experience with managing and maintaining core services such as DNS and networking.
Strong programming skills in Python and/or Go.
Experience with both on-premises and cloud environments.
Experience with KubeVirt.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Ability to participate in on-call rotations including weekends.
Preferred Qualifications:
Experience with large-scale, distributed systems.
Experience with other cloud providers (e.g., AWS, Azure, GCP).
Contributions to open-source projects.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8272676
סגור
שירות זה פתוח ללקוחות VIP בלבד