דרושים » הנדסה » Site Reliability Engineer- Infra

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer- Infra
Realize your potential by joining the leading performance-driven advertising company!
As a Site Reliability Engineer- infra, on our Infrastructure team at the TLV office, you will play a key role in ensuring the reliability, scalability, and performance of our critical systems. You will be responsible for managing and improving our core infrastructure, with a focus on automation, monitoring, and incident response. You will work with a wide range of technologies, including Kubernetes, monitoring and observability tools, configuration management systems, and core networking services.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure the reliability, availability, and performance of our infrastructure services.
Manage and maintain our Kubernetes infrastructure, including KubeVirt.
Design, implement, and maintain our monitoring and observability stack (SensuGo, VictoriaMetrics, Prometheus, ELK).
Automate infrastructure provisioning, configuration, and deployment processes using Puppet and Ansible.
Manage and maintain core services such as DNS and networking.
Troubleshoot and resolve complex infrastructure issues in a timely and efficient manner.
Participate in on-call rotations and incident response.
Develop and maintain infrastructure-as-code (IaC).
Identify and implement proactive measures to prevent incidents and improve system reliability.
Collaborate with development teams to ensure smooth and reliable deployments.
Contribute to the design and implementation of new infrastructure solutions.
Drive improvements in system architecture, processes, and tools.
Mentor and coach other team members.
Requirements:
5+ years of experience in a Site Reliability Engineering, Systems Engineering, or similar role.
Deep understanding of Site Reliability Engineering principles and practices.
Extensive experience with Kubernetes, including deployment, management, and troubleshooting.
Strong experience with monitoring and observability tools such as SensuGo, Zabbix, VictoriaMetrics, Prometheus, and ELK.
Proficiency in configuration management tools such as Puppet and Ansible.
Solid understanding of Linux internals and networking.
Experience with managing and maintaining core services such as DNS and networking.
Strong programming skills in Python and/or Go.
Experience with both on-premises and cloud environments.
Experience with KubeVirt.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Ability to participate in on-call rotations including weekends.
Preferred Qualifications:
Experience with large-scale, distributed systems.
Experience with other cloud providers (e.g., AWS, Azure, GCP).
Contributions to open-source projects.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8272676
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer
Realize your potential by joining the leading performance-driven advertising company!
As Site Reliability Engineer on the IT Production team in our TLV Office, youll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.
Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).
Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.
Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.
Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.
Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
Our Tech Stack:
Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.
Requirements:
To thrive in this role, youll need:
7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
Experience supporting, troubleshooting and scaling large distributed systems in production.
Deep understanding of HTTP protocol, including HTTP/1.1, HTTP/2, caching semantics, TLS and gRPC delivery.
Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
Deep understanding in Linux system internals and system performance tuning.
Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
Experience with containerization technologies (Kubernetes, Docker).
Deep understanding of networking principles (TCP/IP, DNS, load balancing).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8273985
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
14/07/2025
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
Cyberint, a market leader in External Risk Management, empowers global organizations to detect, respond, and remediate external threats efficiently. Now part of our company, Cyberint continues to grow and innovate at the intersection of cybersecurity and cloud-native SaaS technologies.
Join Our Operations Team
We are seeking a proactive, experienced Site Reliability Engineer (SRE) to join our dynamic Operations team. Youll be working on a cutting-edge SaaS solution that runs on AWS (EKS-based Kubernetes infrastructure), supporting an architecture with many moving parts. If you're driven by reliability engineering, love automation, and want to make an impact on mission-critical platforms, this role is for you.
What Youll Do
As an SRE at Cyberint, you will be instrumental in ensuring the observability, stability, and scalability of our platform. You will develop automated solutions and monitoring tools to proactively detect and respond to incidents, improve system resilience, and collaborate with engineering teams across the company to embed operational excellence into our product lifecycle.
Additionally, you will help evolve our AI-driven operational and monitoring tooling, including our on-call assistant bot, which leverages AI technologies to streamline incident resolution, automate repetitive tasks, and support real-time decision-making for engineers.
Key Responsibilities
Design, implement, and maintain monitoring and alerting systems (e.g., Prometheus, Grafana) to detect and prevent reliability issues.
Develop tools and automation (Python, Bash, etc.) for improving infrastructure reliability and operational efficiency.
Collaborate with R&D and Product teams to embed reliability-first principles into every stage of the development process.
Participate in and improve incident response processes, including running blameless postmortems and implementing preventive measures.
Enhance our Infrastructure-as-Code (IaC) and CI/CD practices to streamline deployments and reduce risk.
Maintain and extend internal AI-driven tools, such as bots that support SRE workflows (on-call management, triaging, etc.).
Document infrastructure, playbooks, and operational procedures to facilitate onboarding and knowledge sharing.
Requirements:
3+ years of experience in an SRE, DevOps, or similar role in a SaaS/cloud-native environment.
Strong experience with Kubernetes, AWS, and cloud-based distributed systems.
Hands-on experience building or maintaining monitoring stacks such as Prometheus, Grafana, ELK, etc.
Proficiency in Python, Bash, or similar scripting languages.
Experience with Infrastructure as Code tools (Terraform, Helm, etc.).
Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, ArgoCD).
Solid analytical and problem-solving skills with a passion for operational excellence.
Exposure to AI-based tooling (e.g., OpenAI API, LLM-based bots) to automate operations or enhance incident response processes.
Nice to Have
Experience with incident management platforms (e.g., PagerDuty).
Security-minded mindset and experience in the cybersecurity industry.
Experience with service mesh, zero-downtime deployments, or chaos engineering.
Contributions to AI-assisted SRE initiatives or platform operations & monitoring automation.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8257631
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
14/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
We are looking for a Site Reliability Engineer (SRE) to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.
We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.
In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.
About Us
we are a key player the network security field, striving to provide the leading SASE platform in the market. Our innovative approach, merging cloud and on-device protection, redefines how businesses connect in the era of cloud and remote work.
Key Responsibilities
Develop and maintain our monitoring, alerting, and logging systems, ensuring high visibility into production environments.
Implement automation to improve system reliability, scalability, and efficiency.
Troubleshoot and resolve production incidents, leading root cause analyses and implementing permanent fixes.
Collaborate with software engineers and DevOps teams to enhance application performance and resilience.
Continuously improve operational processes, focusing on reducing toil and improving reliability.
Requirements:
3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Hands-on experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
Strong understanding of Linux systems, networking, and cloud-native architectures.
Experience with Kubernetes, Terraform, and CI/CD pipelines.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
It would be great if you also have:
Experience in high-scale distributed systems.
Background in security and compliance for cloud infrastructure.
Familiarity with AWS (EKS, EC2, RDS, S3, networking configurations).
Proficiency in Python, Go, or Bash for automation and scripting.
Understanding of cost optimization and resource management in cloud environments.
Familiarity with machine learning or predictive analytics for proactive reliability management.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8258448
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Were looking for a Site Reliability Engineer (SRE) to enhance the reliability, performance, and scalability of our production infrastructure. This role goes beyond keeping systems runningyoull be a key player in shaping the culture of reliability, driving self-healing mechanisms, proactive alerting strategies, and automation to reduce toil and improve operational efficiency. You'll work closely with engineering teams to ensure high availability, observability, and smooth incident management processes.
Responsibilities
Ensure reliability & scalability of our production environment across multiple cloud providers.
Define and implement SRE best practicesfostering a culture of ownership, continuous improvement, and automation.
Automate everythingfrom infrastructure deployment to self-healing mechanisms that eliminate manual intervention.
Design and improve observability solutions (monitoring, logging, tracing) to enable faster detection and resolution of issues.
Optimize alerting strategies to ensure actionable, high-quality alerts while minimizing noise and fatigue.
Improve system resilience, driving chaos engineering, failover strategies, and automatic recovery processes.
Enhance incident response processes, including on-call strategies, root cause analysis, and post-mortems to drive long-term stability.
Collaborate with development teams to build reliable, scalable, and efficient architectures, ensuring seamless deployment and rollback processes.
Promote a culture of reliability, educating teams on best practices, service ownership, and production-readiness.
Requirements:
3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Strong expertise in Kubernetes and container orchestration in production.
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
Proven experience with monitoring & observability tools (Prometheus, ELK, Grafana, Coralogix, etc.).
Strong scripting/programming skills (Python, Go, Bash, or similar).
Experience with Infrastructure as Code (IaC)Terraform, Helm, or similar tools.
Track record of improving system reliability, scalability, and performance.
Experience designing and implementing self-healing mechanisms to minimize human intervention.
Ability to foster a strong reliability culture across engineering teams, leading by example.
Excellent problem-solving skills, with a proactive and ownership-driven mindset.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8228722
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
13/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Site Reliability Engineer to join our DevOps team. You will ensure the reliability, performance, and scalability of our back-office solutions, which serve as the foundation for the entire purchasing process. This role will lead the development of SRE capabilities, meeting SLI/SLO/SLA targets, and establishing effective monitoring systems. You will enhance our Software Development Lifecycle by integrating reliability and scalability, working with cross-functional teams, and supporting production environments. Additionally, you will implement incident management processes and conduct post-mortem analyses to drive continuous improvement. If you have a strong engineering and automation background and are passionate about the E-commerce field, then we would love to hear from you.
Roles and Responsibilities:
Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
Lead SRE initiatives to boost system resilience and operational efficiency.
Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
Requirements:
5+ years of experience as a Site Reliability/DevOps Engineer
Deep understanding of E-commerce flows, specifically with back-office operations and order processing - must
Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems - must
Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack - must
Proficiency in scripting and automation using languages such as Python, Java, etc. - must
Ability to create dashboards, alerts, and insightful queries - must
Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) - plus
Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks - plus
Proactive approach with excellent problem-solving skills
Strong collaborator, with an ability to work with cross-functional teams
Proficient in English.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8255386
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
13/07/2025
Location: Tel Aviv-Yafo and Netanya
Job Type: Full Time
At our company, were reinventing DevOps to help the worlds greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if youre willing to do more, your career can take off. And since software plays a central role in everyones lives, youll be part of an important mission. Thousands of customers, including the majority of the Fortune 100, trust our company to manage, accelerate, and secure their software delivery from code to production -- a concept we call liquid software. Wouldn't it be amazing if you could join us on our journey?
our company seeks a highly-skilled Senior Site Reliability Engineer to join our team! In this role, you will drive best practices, optimize operational workflows, and mentor junior engineers, fostering a culture of collaboration and innovation. This is an exciting opportunity for someone passionate about building and integrating services and systems that ensure the availability, performance, and reliability of our company SaaS environments. You will lead large-scale, cross-functional initiatives, You will work closely with P&E engineering and Cloud teams to design, build, and maintain scalable, resilient infrastructure while championing best practices for automation, monitoring, and incident response. If you're eager to make a significant impact in a fast-paced, high-growth environment, we encourage you to apply.
As a Senior Site Reliability Engineer in our company you will
Lead and groom the team towards technical solutions guided by a strong understanding of the latest and greatest technologies like Kubernetes, Helm, Terraform, and more
Advocate, build, and manage scalable and reliable services and infrastructure to support our company SaaS services
Apply SRE best practices, including incident management, performance and capacity planning, and disaster recovery flows
Drive the reliability, performance, and availability of our SaaS products, ensuring service-level objectives are met or exceeded
Design, develop, and manage large-scale systems with CI/CD in mind, to support multiple production environments and use cases
Tackle large-scale production issues and bring out-of-the-box thinking to the table
Evaluate new cloud-native technologies and vendor products to continuously improve our SaaS offering
Requirements:
5+ years of relevant DevOps or SRE experience in large-scale production environments
2+ years of infrastructure automation, configuration management, or container orchestration using Kubernetes, Docker, Terraform, and Ansible
2+ years in Python or any other advanced programming language
Strong ability to lead, design, and execute cross-organization projects
Experience in managing container and infrastructure orchestration tools (e.g. Kubernetes, Terraform)
Hands-on experience administering public clouds (AWS, GCP, or Azure)
Experience with building CI/CD pipelines for applications and microservices (Jenkins/ArgoCD)
Experience with chaos, alerting & observability tools (Gremlin, PagerDuty, Opsgenie, New Relic, Coralogix).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8255520
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
15/07/2025
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
We are looking for a Site Reliability Engineering (SRE) & Production Team Leader to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.
We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.
In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.
About Us
we are a key player the network security field, striving to provide the leading SASE platform in the market. Our innovative approach, merging cloud and on-device protection, redefines how businesses connect in the era of cloud and remote work.
Key Responsibilities
Design, build, and manage our SRE framework to ensure observability, resilience, and high availability.
Develop and automate solutions for proactive monitoring, incident response, and performance optimization.
Improve and maintain our alerting and monitoring stack, leveraging tools like Datadog, Prometheus, and Grafana.
Lead post-mortem analysis and implement continuous improvement initiatives.
Collaborate with DevOps, Engineering, and Product teams to ensure smooth and efficient delivery of reliable services.
Requirements:
SRE & Production Manager with 5+ years of experience in SRE, Production Engineering, or DevOps, including 2+ years in a leadership role.
Experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
Experience with AWS (EKS, EC2, RDS, S3, networking configurations).
It would be great if you also have:
Experience in high-scale distributed systems.
Background in security and compliance for cloud infrastructure.
Understanding of cost optimization and resource management in cloud environments.
Familiarity with machine learning or predictive analytics for proactive reliability management.
Proficiency in Python, Go, or Bash for automation and scripting.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8259881
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
15/07/2025
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
we are a global leader in cybersecurity, dedicated to protecting organizations from cyber threats. Our team is at the forefront of developing innovative cloud solutions, and we are looking for a Senior DevOps Engineer to join our Cloud Network Security group.
Key Responsibilities
As a DevOps Engineer at our company, you will design, implement, and manage CI/CD pipelines, collaborate with cross-functional teams, and ensure the high availability and reliability of our cloud-based services and solutions.
Responsibilities:
Design, implement, and manage CI/CD pipelines to automate the deployment of SaaS
Collaborate with development, QA, and operations teams to ensure smooth and reliable software releases.
Monitor system performance and troubleshoot issues to ensure high availability and reliability of our services.
Implement and manage infrastructure as code (IaC) using tools like Terraform, CloudFormation and ARM.
Optimize system performance, scalability, and security.
Develop and maintain documentation for infrastructure and deployment processes.
Requirements:
Your Knowledge & Skills:
2-4 years of experience in DevOps or a related role, working with distributed systems and SaaS applications.
Proficiency with CI/CD tools such as Gerrit, GitLab CI, GitHub
Experience with Cloud Providers like: AWS, Azure, GCP
Solid foundation in Cloud account users management & cost optimizations (FinOps principles)
Solid understanding of networking, security, and system administration.
Familiarity with logging and monitoring stacks (e.g., Elasticsearch, CloudWatch, Grafana, Prometheus).
Proficiency in scripting (Python, Bash) for automation and tooling.
Solid grasp of IaC & GitOps principles and best practices (Terraform, Helm, ArgoCD, Crossplane).
Knowledge of agile methodologies and practices
Strong knowledge of distributed systems, microservices, and orchestration technologies
Expertise in containerization and orchestration tools like Docker and Kubernetes
Mindset & Traits:
An innovative approach, with strong communication and collaboration skills
Independent, autodidact, and passionate about new DevOps challenges
Passion for automation, self-service, and continuous improvement
Comfortable working in fast-paced SaaS environments with cross-functional teams
Excellent problem-solving skills and attention to detail
Advantages:
Network Security background
Knowledge in our company's products.
Bachelors degree in Computer Science or a related technical field
Certifications in AWS, Azure, or other relevant technologies.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8259831
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
21/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
we are seeking a Site Reliability Engineer who excels at bridging the gap between infrastructure and development. In this role, you will work closely with engineering teams to ensure the reliability, scalability, and performance of our systems. A strong emphasis will be placed on observability - designing and implementing effective monitoring, logging, tracing and alerting solutions to provide deep visibility into system behavior. You should be comfortable collaborating with developers, presenting technical insights, and helping shape best practices. Your responsibilities will include incident management, automation and improvement of our observability solutions, and continuous performance tuning to ensure our platform can scale and evolve with our business needs.

Role:
Ensure production systems meet or exceed established SLAs and SLOs by actively maintaining and enhancing system performance and uptime.
Design and maintain end-to-end observability systemsincluding monitoring, logging, and distributed tracingto detect anomalies and enable proactive issue resolution.
Work closely with engineering teams to improve how their applications are monitored and alerted on. Help define meaningful alerts, reduce noise, and ensure developers are accountable for the operational health of their services.
Optimize application performance on Kubernetes through resource tuning, scaling strategies, and deep performance analysis.
* Provide guidance on reliability-first design, instrumenting code for observability, and using Grafana dashboards to drive decision-making and incident response.
Requirements:
5+ years in SRE, DevOps, or Production Engineering roles
Deep expertise in AWS, Kubernetes, Linux
Being responsible of deploying and tuning monitoring tools like Prometheus, Thanos and any time-series databases for storing metrics.
Logging responsibilities with ELK stack, Loki, Grafana or any alternatives.
Experience with tracing opentelemetry, tempo, jaeger
Strong understanding of incident management processes and best practices.
Experience with automation tools and practices for deployment and infrastructure management.
Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
Ownership mindset, proactive and reliable
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8268431
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
15/07/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
our company's Infinity External Risk Management, otherwise known as Cyberint, continuously reduces external cyber risk by managing and mitigating an array of digital threats with a unified solution.
At Cyberint, we help organizations protect their digital presence by delivering cutting-edge Attack Surface Management (ASM) and Threat Intelligence (TI) solutions. As a member of our R&D organization, youll play a key role in ensuring the scalability, reliability, and performance of our cloud-native SaaS platform operating at scale.
Key Responsibilities
As a DevOps Engineer, you will be a core member of our DevOps & Infrastructure team, focused on building and maintaining distributed, scalable, and highly available systems in a dynamic SaaS environment. You will collaborate closely with development, QA, and support teams to enhance automation, improve CI/CD pipelines, and drive operational excellence across the board.
Key Responsibilities:
Design, build, and maintain infrastructure in a modern cloud-native SaaS ecosystem (primarily AWS).
Contribute to the scalability and reliability of distributed systems supporting high-volume data processing and real-time operations.
Develop and enhance CI/CD pipelines to support rapid and reliable deployments across multiple environments.
Implement and manage Infrastructure as Code (IaC) using Terraform for consistent, scalable infrastructure.
Operate and optimize Kubernetes (EKS) clusters to support distributed microservices architectures.
Monitor and respond to system alerts, troubleshoot issues, and contribute to incident prevention and response strategies.
Build self-service tools and automation frameworks to empower R&D teams and enhance delivery velocity.
Work cross-functionally with developers, QA, and support to ensure infrastructure meets evolving product needs.
Write and maintain scripts (Python, Bash) to automate recurring tasks and streamline operations.
Continuously identify and execute improvements in system performance, availability, and cost-efficiency.
Requirements:
Experience:
25 years of experience in DevOps, SRE, or infrastructure engineering roles, working with distributed systems and SaaS applications.
Hands-on experience with public cloud providers (AWS strongly preferred).
Production experience with tools such as Kubernetes, Terraform, CI/CD platforms (Jenkins, ArgoCD), and monitoring systems (Prometheus, Grafana).
Skills:
Solid grasp of Infrastructure as Code principles and best practices.
Strong knowledge of distributed systems, microservices, and orchestration technologies.
Proficiency in scripting (Python, Bash) for automation and tooling.
Familiarity with logging and monitoring stacks (e.g., Elasticsearch, Redis, CloudWatch, Grafana, Prometheus).
Awareness of DevOps security practices and cloud cost optimization strategies.
Mindset & Traits:
A strong sense of ownership and accountability for system health and performance.
Passion for automation, self-service, and continuous improvement.
Excellent communication and collaboration skills.
Comfortable working in fast-paced SaaS environments with cross-functional teams.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8259928
סגור
שירות זה פתוח ללקוחות VIP בלבד