דרושים » תוכנה » Site Reliability Engineer- Infra

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer- Infra
Realize your potential by joining the leading performance-driven advertising company!
As a Site Reliability Engineer- infra, on our Infrastructure team at the TLV office, you will play a key role in ensuring the reliability, scalability, and performance of our critical systems. You will be responsible for managing and improving our core infrastructure, with a focus on automation, monitoring, and incident response. You will work with a wide range of technologies, including Kubernetes, monitoring and observability tools, configuration management systems, and core networking services.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure the reliability, availability, and performance of our infrastructure services.
Manage and maintain our Kubernetes infrastructure, including KubeVirt.
Design, implement, and maintain our monitoring and observability stack (SensuGo, VictoriaMetrics, Prometheus, ELK).
Automate infrastructure provisioning, configuration, and deployment processes using Puppet and Ansible.
Manage and maintain core services such as DNS and networking.
Troubleshoot and resolve complex infrastructure issues in a timely and efficient manner.
Participate in on-call rotations and incident response.
Develop and maintain infrastructure-as-code (IaC).
Identify and implement proactive measures to prevent incidents and improve system reliability.
Collaborate with development teams to ensure smooth and reliable deployments.
Contribute to the design and implementation of new infrastructure solutions.
Drive improvements in system architecture, processes, and tools.
Mentor and coach other team members.
Requirements:
To thrive in this role, youll need:
5+ years of experience in a Site Reliability Engineering, Systems Engineering, or similar role.
Deep understanding of Site Reliability Engineering principles and practices.
Extensive experience with Kubernetes, including deployment, management, and troubleshooting.
Strong experience with monitoring and observability tools such as SensuGo, Zabbix, VictoriaMetrics, Prometheus, and ELK.
Proficiency in configuration management tools such as Puppet and Ansible.
Solid understanding of Linux internals and networking.
Experience with managing and maintaining core services such as DNS and networking.
Strong programming skills in Python and/or Go.
Experience with both on-premises and cloud environments.
Experience with KubeVirt.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Ability to participate in on-call rotations including weekends.
Preferred Qualifications:
Experience with large-scale, distributed systems.
Experience with other cloud providers (e.g., AWS, Azure, GCP).
Contributions to open-source projects.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8205377
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer
Realize your potential by joining the leading performance-driven advertising company!
As Site Reliability Engineer on the IT Production team in our TLV Office, youll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.
Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).
Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.
Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.
Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.
Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
Our Tech Stack:
Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.
Requirements:
7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
Experience supporting, troubleshooting and scaling large distributed systems in production.
Deep understanding of HTTP protocol, including HTTP/1.1, HTTP/2, caching semantics, TLS and gRPC delivery.
Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
Deep understanding in Linux system internals and system performance tuning.
Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
Experience with containerization technologies (Kubernetes, Docker).
Deep understanding of networking principles (TCP/IP, DNS, load balancing).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8205371
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
We are seeking a skilled Site Reliability Engineer (SRE) to join our team and help build, maintain, and improve the reliability, scalability, and performance of our systems. As an SRE, you will be responsible for owning observability tools, driving incident management processes, and implementing automation to enhance our infrastructure. This role involves collaborating across teams to ensure a robust and efficient technology stack supporting mission-critical systems.

You will:
Proactively enhance system reliability, scalability, and performance through automation, monitoring, and capacity planning.
Develop and maintain observability systems, including distributed tracing, logging, and metrics platforms.
Establish and maintain organizational standards for monitoring, leveraging tools like Prometheus, Grafana, and OpenTelemetry.
Drive incident management, root cause analysis, and continuous improvement initiatives.
Partner with development teams to integrate reliability best practices into the software development lifecycle.
Manage infrastructure at scale in cloud services (AWS advantage) and platforms like Kubernetes or ECS.
Optimize resource utilization to reduce costs while maintaining service quality.
Requirements:
At least 5 years of experience as a SRE.
Strong experience with Observability Tools: Proficiency with OpenTelemetry, Grafana, Prometheus, and ELK stack (Elasticsearch, Logstash, Kibana).
Experience with Cloud Platforms: In-depth knowledge of AWS services, including EC2, S3, RDS, and CloudFormation/Terraform for infrastructure-as-code.
Proficiency in scripting and/or development languages like Bash or Python.
Thorough understanding of CI/CD pipelines and automation tools.
Understanding of Infrastructure as Code, and strong experience with automation tools like Terraform and/or Ansible.
Solid troubleshooting and debugging skills.
A team player with a strong can-do mentality.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8163101
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
Were growing and looking to hire Site Reliability Engineer (SRE) who embodies our core values: People First, Customer Obsession, Strive for Excellence, and Integrity.
We are looking for a skilled and motivated Site Reliability Engineer (SRE) to join our team and help ensure our production cloud environment's reliability, performance, and scalability. As an SRE, you will work at the intersection of software engineering and operations, taking ownership of system stability, incident response, automation, and continuous improvement of our infrastructure.
This role is ideal for engineers who thrive in dynamic environments, value reliability, and enjoy building resilient and scalable systems.
As an SRE, Your impact will be:
Production Reliability: Ensure system uptime and performance by identifying and addressing potential issues before they affect end users.
Incident Response: Serve as part of the on-call rotation, rapidly diagnosing and resolving incidents, and conducting root cause analysis and postmortems.
Monitoring and Alerting: Build and maintain monitoring dashboards and alerting systems to detect and respond to anomalies in real time.
Automation and Tooling: Develop and maintain automation tools for deployments, scaling, and operational efficiency using Terraform, Ansible, Bash, or Python.
Infrastructure Maintenance: Perform regular maintenance and upgrades of production infrastructure to ensure security, stability, and performance.
Release Engineering: Support and optimize the rollout of new features and updates, minimizing risk and impact on production environments.
Staging Environment Management: Ensure staging environments accurately reflect production for robust testing and validation of changes.
Requirements:
Experience in SRE, DevOps, or production engineering roles
Strong skills in system troubleshooting, incident response, and root cause analysis
Proficiency with tools such as:
Jenkins, Terraform, Ansible, GIT, GitHub
Bash, Python
AWS, ArgoCD, or similar CI/CD and cloud platforms
Familiarity with observability tools and practices (metrics, logging, tracing)
Ability to work effectively in cross-functional teams
Strong communication and documentation skills
Bachelor's degree in Computer Science, Information Technology, or a related field (preferred)
Familiarity with Agile development methodologies
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8198455
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
01/06/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Site Reliability Engineer (SRE) to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.

We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.

In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.


Key Responsibilities
Develop and maintain our monitoring, alerting, and logging systems, ensuring high visibility into production environments.
Implement automation to improve system reliability, scalability, and efficiency.
Troubleshoot and resolve production incidents, leading root cause analyses and implementing permanent fixes.
Collaborate with software engineers and DevOps teams to enhance application performance and resilience.
Continuously improve operational processes, focusing on reducing toil and improving reliability.
Requirements:
3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Hands-on experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
Strong understanding of Linux systems, networking, and cloud-native architectures.
Experience with Kubernetes, Terraform, and CI/CD pipelines.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8200136
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
20/05/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
we are looking for a Senior DevOps Engineer to join our Cloud Network Security group.

Key Responsibilities
As a DevOps Engineer at Check Point, you will design, implement, and manage CI/CD pipelines, collaborate with cross-functional teams, and ensure the high availability and reliability of our cloud-based services and solutions.

Responsibilities:

Design, implement, and manage CI/CD pipelines to automate the deployment of SaaS
Collaborate with development, QA, and operations teams to ensure smooth and reliable software releases.
Monitor system performance and troubleshoot issues to ensure high availability and reliability of our services.
Implement and manage infrastructure as code (IaC) using tools like Terraform, CloudFormation and ARM.
Optimize system performance, scalability, and security.
Develop and maintain documentation for infrastructure and deployment processes.
Requirements:
2-4 years of experience in DevOps or a related role, working with distributed systems and SaaS applications.
Proficiency with CI/CD tools such as Gerrit, GitLab CI, GitHub
Experience with Cloud Providers like: AWS, Azure, GCP
Solid foundation in Cloud account users management & cost optimizations (FinOps principles)
Solid understanding of networking, security, and system administration.
Familiarity with logging and monitoring stacks (e.g., Elasticsearch, CloudWatch, Grafana, Prometheus).
Proficiency in scripting (Python, Bash) for automation and tooling.
Solid grasp of IaC & GitOps principles and best practices (Terraform, Helm, ArgoCD, Crossplane).
Knowledge of agile methodologies and practices
Strong knowledge of distributed systems, microservices, and orchestration technologies
Expertise in containerization and orchestration tools like Docker and Kubernetes
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8185035
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
2 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
we're on a mission to redefine vehicle safety and reliability on a global scale. Founded in 2016, we have pioneered the world's first fully automated suite of vehicle inspection systems. At the heart of this innovation lies our advanced AI-driven technology, representing the pinnacle of Machine Learning, GenAI, and computer vision within the automotive sector. With close to $400M in funding and strategic partnerships with industry giants such as Amazon, General Motors, Volvo, and CarMax, our company stands at the forefront of automotive technological advancement. Our growing global team of over 200 employees is committed to creating a workplace that celebrates diversity and encourages teamwork. Our drive for innovation and pursuit of excellence are deeply Embedded in our vibrant company culture, ensuring that each individual's efforts are recognized and valued as we unite to build a safer automotive world.
We are seeking a highly motivated and skilled Release Engineer to join our AIOps group. In this role, you'll play a critical part in bridging the gap between development and operations, ensuring the seamless qualification, deployment, and monitoring of our AI algorithms and infrastructure, and be responsible for the end-to-end operationalization of our core technology.
A day in the life and how youll make an impact:
* Manage the end-to-end release process of Machine Learning algorithms and infrastructure components, from qualification through deployment.
* Validate and TEST new algorithm releases to ensure they meet performance, stability, and compliance standards.
* Create and execute deployment plans across various environments (staging, production), ensuring minimal risk and downtime.
* Collaborate closely with AI researchers, MLOps, and software engineers to understand release requirements, share feedback, and resolve pre-release issues.
* Identify and drive automation opportunities within the release pipeline to improve efficiency, reliability, and traceability.
* Oversee updates to infrastructure components, ensuring compatibility and performance across systems.
* Monitor deployments, proactively identify issues related to model behavior or infrastructure anomalies, and drive resolution with relevant teams.
* Maintain clear and accurate release documentation, including version history, deployment notes, and incident reports.
Requirements:
* Bachelor's degree in Computer Science, Software Engineering, or industry equivalent.
* 2+ years of experience in QA & Automation
* Proficiency in scripting languages (e.g., Python, Bash).
* Experience with containerization technologies (e.g., Docker, Kubernetes).
* Familiarity with CI/CD pipelines (e.g., GitLab CI/CD, Jenkins).
* Experience with cloud platforms (e.g., AWS, GCP).
* Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
* Excellent problem-solving skills and attention to detail.
* Strong communication and collaboration skills, with the ability to work effectively with cross-functional teams.
Bonus if you have: Strong understanding of the Machine Learning lifecycle, from experimentation to deployment and monitoring.
* Experience with specific MLOps platforms or tools.
* Experience in a fast-paced startup environment.

Why us: Pioneer Advanced Solutions: Harness cutting-edge technologies in AI, Machine Learning, and computer vision to revolutionize vehicle inspections. Drive Global Impact: Your innovations will play a crucial role in enhancing automotive safety and reliability, impacting lives and businesses on an international scale. Career Growth Opportunities: Participate in a journey of rapid development, surrounded by groundbreaking advancements and strategic industry partnerships.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8214831
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
01/06/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Site Reliability Engineering (SRE) & Production Team Leader to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.

We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.

In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.

Key Responsibilities
Design, build, and manage our SRE framework to ensure observability, resilience, and high availability.
Develop and automate solutions for proactive monitoring, incident response, and performance optimization.
Improve and maintain our alerting and monitoring stack, leveraging tools like Datadog, Prometheus, and Grafana.
Lead post-mortem analysis and implement continuous improvement initiatives.
Collaborate with DevOps, Engineering, and Product teams to ensure smooth and efficient delivery of reliable services.
Requirements:
SRE & Production Manager with 5+ years of experience in SRE, Production Engineering, or DevOps, including 2+ years in a leadership role.
Experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
Experience with AWS (EKS, EC2, RDS, S3, networking configurations).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8200138
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
As a Senior Site Reliability Engineer, youll play a key role in shaping our new Production Reliability domain. Youll drive reliability initiatives, lead cross-team projects, and make sure our SaaS platform stays robust, scalable, and efficient. This is a high-impact, hands-on role that demands technical expertise and a proactive approach.

As a Senior SRE, you will:

Design, build, and maintain scalable, fault-tolerant systems.
Define and enforce SLOs, SLIs, and SLAs and drive improvements based on real data.
Build automation and tooling to enhance observability, testing, and deployments.
Lead complex incident responses, including on-call rotations and postmortems.
Collaborate closely with engineering, product, and support teams to embed reliability into everything we do.
Mentor engineers and promote operational excellence across the organization.
Requirements:
Have 7+ years of experience in SRE, DevOps, or Production Engineering roles, ideally in SaaS environments.
Bring deep expertise in resilience engineering, monitoring, and building fault-tolerant systems.
Are hands-on with monitoring tools like Datadog, Dynatrace, Opensearch, Coralogix, or Sentry.
Are experienced with CI/CD tools like Jenkins or ArgoCD.
Are proficient with infrastructure-as-code tools like Terraform or Crossplane.
Have strong knowledge of Linux systems and networking fundamentals.
Have solid experience with cloud platforms (AWS preferred).
Are an advanced coder in Java (Python or Go is a plus).
Know Kubernetes and the broader CNCF ecosystem inside out.
Excel at debugging and root cause analysis.
Are fluent in Hebrew and English.
Bring a high sense of ownership and accountability to everything you do.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8199501
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
25/05/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
At UVeye, we're on a mission to redefine vehicle safety and reliability on a global scale. Founded in 2016, we have pioneered the world's first fully automated suite of vehicle inspection systems. At the heart of this innovation lies our advanced AI-driven technology, representing the pinnacle of machine learning, GenAI, and computer vision within the automotive sector. With close to $400M in funding and strategic partnerships with industry giants such as Amazon, General Motors, Volvo, and CarMax, UVeye stands at the forefront of automotive technological advancement. Our growing global team of over 200 employees is committed to creating a workplace that celebrates diversity and encourages teamwork. Our drive for innovation and pursuit of excellence are deeply embedded in our vibrant company culture, ensuring that each individual's efforts are recognized and valued as we unite to build a safer automotive world.
We are looking for a DevOps Engineer to join our DevOps R&D team. In this position, you will be responsible for integrating developers and operations teams to improve collaboration and productivity by automating infrastructure, automating workflows, and continuously measuring application performance.
A day in the life and how you’ll make an impact:
* Establish, maintain, and evolve concepts in continuous integration and deployment (CI/CD) pipelines for existing and new services.
* Collaborate with Engineering and Operations teams to improve automation of workflows, infrastructure, code testing, and deployment of on-premise and cloud services.
* Remain up-to-date on industry trends, share knowledge among teams, and abide by industry best practices for configuration management and automation.
* Implement effective monitoring and increase the sophistication of our alerting and escalation mechanisms
* Identify and resolve performance and scalability issues in products and infrastructure.
Requirements:
* 5+ years of experience in systems and production engineering and 3+ years of DevOps experience in a Linux environment
* Experience maintaining and deploying highly available, fault-tolerant systems at scale
* Experience in developing Python and scripting using bash
* Practical experience with Docker containerization and clustering (Kubernetes)
* Experience with configuration management tools (e.g. Ansible, Terraform)
* Experience implementing CI/CD (e.g. Jenkins,, GitHub actions, bitbucket pipelines)
* Experience with cloud providers (eg: AWS, GCP)
Ideally, we’re looking for:
* Bachelor's or master’s degree in CS
* AWS Certification
* Experience working in and advocating for agile environments
* Knowledge of Linux Kernel fundamentals, including job management, memory management, file systems, networking & debugging

Why UVeye: Pioneer Advanced Solutions: Harness cutting-edge technologies in AI, machine learning, and computer vision to revolutionize vehicle inspections. Drive Global Impact: Your innovations will play a crucial role in enhancing automotive safety and reliability, impacting lives and businesses on an international scale. Career Growth Opportunities: Participate in a journey of rapid development, surrounded by groundbreaking advancements and strategic industry partnerships.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8010890
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
20/05/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are seeking a results-driven DevOps Team Lead to head the DevOps and Infrastructure team within our R&D organization. This role requires strategic vision, technical expertise, and a proactive approach to drive operational excellence, empower teams with robust tools and automation, and ensure high system reliability and scalability.

As the DevOps Team Lead, you will be instrumental in delivering critical KPIs, including system uptime, automation, incident management, and collaboration with development and QA teams to enable self-sufficiency. Additionally, you will serve as the leader for strategic projects, identifying opportunities to improve infrastructure and operational processes, setting long-term goals, and executing initiatives that align with Cyberints business objectives and growth.

Key Responsibilities
Strategic Leadership

Identify and lead strategic projects to enhance Cyberints platform scalability, reliability, and operational efficiency.
Develop and execute a roadmap for critical infrastructure and DevOps initiatives that drive business success.
Collaborate with senior stakeholders to align projects with organizational priorities and deliver measurable outcomes.
System Reliability & Uptime

Lead initiatives to ensure system reliability, minimize disruptions, and maintain high availability for Cyberints SaaS platform.
Establish and manage proactive monitoring, alerting, and preventive maintenance strategies.
Drive incident prevention efforts, ensuring robust failover and disaster recovery mechanisms.
Develop and maintain playbooks to enable rapid diagnosis and resolution of issues.
Automation, Infrastructure as Code (IaC), & Self-Service Enablement

Champion the adoption of automation and IaC to streamline infrastructure management and deployments.
Build and enhance self-service tools and frameworks, empowering R&D teams to operate independently with minimal reliance on DevOps.
Continuously improve CI/CD pipelines to optimize deployment speed and reliability.
Collaboration & Support for Self-Sufficiency

Collaborate closely with development, QA, and support teams to deliver tools and frameworks that promote team autonomy and efficiency.
Advocate for cross-functional engagement to align operational processes with R&D objectives.
Provide training and mentorship to teams on using DevOps tools effectively.
Accountability, Ownership, & Scalability

Take ownership of all systems and infrastructure, ensuring solutions are scalable, resilient, and aligned with Cyberints growth objectives.
Establish clear accountability frameworks for maintaining infrastructure and delivering on key projects.
Design and execute a roadmap to support self-service-oriented and scalable solutions.
Requirements:
5+ years of experience in DevOps or SRE roles, with 2+ years in a leadership capacity.
Proven expertise in building and maintaining highly available, cloud-native environments (AWS preferred).
Experience with Kubernetes, Terraform, CI/CD pipelines, and monitoring technology and tools (Prometheus, Grafana, Jenkins, ArgoCD, Terraform, Elasticsearch, Redis, EKS, etc.).
Skills & Expertise

Strong understanding of automation, Infrastructure as Code (IaC), and self-service enablement.
Expertise in incident management and a track record of delivering reliable, scalable systems.
Hands-on experience with scripting and automation tools (Python, Bash).
Deep understanding of containerization, orchestration, and cloud-native architectures.
Familiarity with cost monitoring and optimization strategies to ensure infrastructure is both efficient and cost-effective.
Knowledge of security best practices for infrastructure and DevOps environments.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8185042
סגור
שירות זה פתוח ללקוחות VIP בלבד