דרושים » מחשבים ורשתות » Site Reliability Engineer 23874

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 7 שעות
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
We are looking for a Site Reliability Engineer (SRE) to join our Engineering team. Someone who has a passion for observability, monitoring, automation, and high-availability systems, and who has a desire to solve complex technological challenges with a proactive approach to continuous improvement.
We use an interesting and mixed technology stack: Kubernetes, Terraform, CI/CD pipelines, Datadog, Prometheus, and cloud-native architectures.
In this position, you will use your expertise in building and scaling SRE operations, and will design, implement, and operate a world-class reliability strategy.
About Us
we are a key player the network security field, striving to provide the leading SASE platform in the market. Our innovative approach, merging cloud and on-device protection, redefines how businesses connect in the era of cloud and remote work.
Key Responsibilities
Develop and maintain our monitoring, alerting, and logging systems, ensuring high visibility into production environments.
Implement automation to improve system reliability, scalability, and efficiency.
Troubleshoot and resolve production incidents, leading root cause analyses and implementing permanent fixes.
Collaborate with software engineers and DevOps teams to enhance application performance and resilience.
Continuously improve operational processes, focusing on reducing toil and improving reliability.
Requirements:
3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Hands-on experience with monitoring and observability tools like Datadog, Prometheus, and Grafana.
Strong understanding of Linux systems, networking, and cloud-native architectures.
Experience with Kubernetes, Terraform, and CI/CD pipelines.
A problem solver, capable of finding creative solutions and getting things done.
Fluent with incident management, RCA processes, and operational best practices.
It would be great if you also have:
Experience in high-scale distributed systems.
Background in security and compliance for cloud infrastructure.
Familiarity with AWS (EKS, EC2, RDS, S3, networking configurations).
Proficiency in Python, Go, or Bash for automation and scripting.
Understanding of cost optimization and resource management in cloud environments.
Familiarity with machine learning or predictive analytics for proactive reliability management.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8258448
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
1 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Site Reliability Engineer to join our DevOps team. You will ensure the reliability, performance, and scalability of our back-office solutions, which serve as the foundation for the entire purchasing process. This role will lead the development of SRE capabilities, meeting SLI/SLO/SLA targets, and establishing effective monitoring systems. You will enhance our Software Development Lifecycle by integrating reliability and scalability, working with cross-functional teams, and supporting production environments. Additionally, you will implement incident management processes and conduct post-mortem analyses to drive continuous improvement. If you have a strong engineering and automation background and are passionate about the E-commerce field, then we would love to hear from you.
Roles and Responsibilities:
Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
Lead SRE initiatives to boost system resilience and operational efficiency.
Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
Requirements:
5+ years of experience as a Site Reliability/DevOps Engineer
Deep understanding of E-commerce flows, specifically with back-office operations and order processing - must
Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems - must
Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack - must
Proficiency in scripting and automation using languages such as Python, Java, etc. - must
Ability to create dashboards, alerts, and insightful queries - must
Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) - plus
Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks - plus
Proactive approach with excellent problem-solving skills
Strong collaborator, with an ability to work with cross-functional teams
Proficient in English.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8255386
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 11 שעות
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
Cyberint, a market leader in External Risk Management, empowers global organizations to detect, respond, and remediate external threats efficiently. Now part of our company, Cyberint continues to grow and innovate at the intersection of cybersecurity and cloud-native SaaS technologies.
Join Our Operations Team
We are seeking a proactive, experienced Site Reliability Engineer (SRE) to join our dynamic Operations team. Youll be working on a cutting-edge SaaS solution that runs on AWS (EKS-based Kubernetes infrastructure), supporting an architecture with many moving parts. If you're driven by reliability engineering, love automation, and want to make an impact on mission-critical platforms, this role is for you.
What Youll Do
As an SRE at Cyberint, you will be instrumental in ensuring the observability, stability, and scalability of our platform. You will develop automated solutions and monitoring tools to proactively detect and respond to incidents, improve system resilience, and collaborate with engineering teams across the company to embed operational excellence into our product lifecycle.
Additionally, you will help evolve our AI-driven operational and monitoring tooling, including our on-call assistant bot, which leverages AI technologies to streamline incident resolution, automate repetitive tasks, and support real-time decision-making for engineers.
Key Responsibilities
Design, implement, and maintain monitoring and alerting systems (e.g., Prometheus, Grafana) to detect and prevent reliability issues.
Develop tools and automation (Python, Bash, etc.) for improving infrastructure reliability and operational efficiency.
Collaborate with R&D and Product teams to embed reliability-first principles into every stage of the development process.
Participate in and improve incident response processes, including running blameless postmortems and implementing preventive measures.
Enhance our Infrastructure-as-Code (IaC) and CI/CD practices to streamline deployments and reduce risk.
Maintain and extend internal AI-driven tools, such as bots that support SRE workflows (on-call management, triaging, etc.).
Document infrastructure, playbooks, and operational procedures to facilitate onboarding and knowledge sharing.
Requirements:
3+ years of experience in an SRE, DevOps, or similar role in a SaaS/cloud-native environment.
Strong experience with Kubernetes, AWS, and cloud-based distributed systems.
Hands-on experience building or maintaining monitoring stacks such as Prometheus, Grafana, ELK, etc.
Proficiency in Python, Bash, or similar scripting languages.
Experience with Infrastructure as Code tools (Terraform, Helm, etc.).
Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, ArgoCD).
Solid analytical and problem-solving skills with a passion for operational excellence.
Exposure to AI-based tooling (e.g., OpenAI API, LLM-based bots) to automate operations or enhance incident response processes.
Nice to Have
Experience with incident management platforms (e.g., PagerDuty).
Security-minded mindset and experience in the cybersecurity industry.
Experience with service mesh, zero-downtime deployments, or chaos engineering.
Contributions to AI-assisted SRE initiatives or platform operations & monitoring automation.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8257631
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Were looking for a Site Reliability Engineer (SRE) to enhance the reliability, performance, and scalability of our production infrastructure. This role goes beyond keeping systems runningyoull be a key player in shaping the culture of reliability, driving self-healing mechanisms, proactive alerting strategies, and automation to reduce toil and improve operational efficiency. You'll work closely with engineering teams to ensure high availability, observability, and smooth incident management processes.
Responsibilities
Ensure reliability & scalability of our production environment across multiple cloud providers.
Define and implement SRE best practicesfostering a culture of ownership, continuous improvement, and automation.
Automate everythingfrom infrastructure deployment to self-healing mechanisms that eliminate manual intervention.
Design and improve observability solutions (monitoring, logging, tracing) to enable faster detection and resolution of issues.
Optimize alerting strategies to ensure actionable, high-quality alerts while minimizing noise and fatigue.
Improve system resilience, driving chaos engineering, failover strategies, and automatic recovery processes.
Enhance incident response processes, including on-call strategies, root cause analysis, and post-mortems to drive long-term stability.
Collaborate with development teams to build reliable, scalable, and efficient architectures, ensuring seamless deployment and rollback processes.
Promote a culture of reliability, educating teams on best practices, service ownership, and production-readiness.
Requirements:
3+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Strong expertise in Kubernetes and container orchestration in production.
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
Proven experience with monitoring & observability tools (Prometheus, ELK, Grafana, Coralogix, etc.).
Strong scripting/programming skills (Python, Go, Bash, or similar).
Experience with Infrastructure as Code (IaC)Terraform, Helm, or similar tools.
Track record of improving system reliability, scalability, and performance.
Experience designing and implementing self-healing mechanisms to minimize human intervention.
Ability to foster a strong reliability culture across engineering teams, leading by example.
Excellent problem-solving skills, with a proactive and ownership-driven mindset.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8228722
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
1 ימים
Location: Tel Aviv-Yafo and Netanya
Job Type: Full Time
At our company, were reinventing DevOps to help the worlds greatest companies innovate -- and we want you along for the ride. This is a special place with a unique combination of brilliance, spirit and just all-around great people. Here, if youre willing to do more, your career can take off. And since software plays a central role in everyones lives, youll be part of an important mission. Thousands of customers, including the majority of the Fortune 100, trust our company to manage, accelerate, and secure their software delivery from code to production -- a concept we call liquid software. Wouldn't it be amazing if you could join us in our journey?
We are looking for a Site Reliability Engineering Manager to lead our Israel SRE team. In this role, you'll drive best practices in reliability engineering, ensuring the stability, availability, and performance of our companys SaaS services. You'll collaborate with global SRE leaders, refine processes, and foster a culture of accountability and continuous improvement.
As a Site Reliability Engineering Manager at our company you will
Lead, mentor, and develop a high-performing SRE Israel team, fostering collaboration, innovation, and accountability
Ensure SaaS reliability, performance, and availability, meeting or exceeding service-level objectives
Drive SRE best practices, including capacity planning, incident management, chaos engineering, and disaster recovery
Implement proactive monitoring, alerting, and anomaly detection aligned with SaaS standards
Collaborate with P&E and Cloud engineering teams to embed reliability into the SDLC
Oversee incident management, ensuring swift identification, escalation, and resolution
Maintain comprehensive SRE documentation, including processes, incident reports, and system architecture
Evaluate and adopt tools, technologies, and methodologies to enhance uptime and reliability.
Requirements:
3+ years of management experience leading a team of SRE, DevOps, or a similar SaaS role
Bachelors degree in Computer Science, Engineering, or related field (or equivalent experience)
Strong expertise in cloud platforms (AWS, GCP, or Azure), containers (Kubernetes, Docker), and configuration management (Terraform, Ansible)
Proficiency in Python or Go for automation and system optimization, as well as GitOps experience with SCM tools (e.g., Git, Bitbucket)
Strong leadership, communication, and collaboration skills, working across globally distributed teams
Familiarity with Agile methodologies, CI/CD pipelines, and orchestration tools (Jenkins, ArgoCD, StackStorm)
Familiarity with Chaos Engineering (e.g., Gremlin, Litmus, Chaos Toolkit)
Hands-on with alerting & observability tools (e.g., PagerDuty, OpsGenie, New Relic, Coralogix)
Strong understanding of scalability, high availability, and security best practices in cloud & Kubernetes environments.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8255508
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
16/06/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a DevOps DevOps Engineer to take ownership of our Cloud Infrastructure and Platform Engineering strategy, enabling high-scale, cutting-edge GenAI products running across 40+ Kubernetes clusters on GCP and AWS.
This role is a hands-on engineering , requiring deep expertise in cloud-native technologies, Kubernetes at scale, and modern DevOps principles. You will work closely with engineering teams to design and implement scalable infrastructure solutions, optimize developer workflows, and ensure reliability and efficiency across our platform.
Role and Responsibilities:
Cloud & Kubernetes Expertise: Design and implement highly scalable multi-cluster Kubernetes environments across GCP & AWS.
Developer Experience & Enablement: Lead the development of self-service tools and automation that improve efficiency for R&D teams.
Incident & Reliability Engineering: Work with engineering teams to optimize cost, performance, and reliability of production infrastructure through monitoring, capacity planning, and scaling strategies.
Security & Governance: Contribute to best practices for RBAC, IAM, cloud security, and compliance while ensuring infrastructure reliability.
Automation & Infrastructure as Code: Drive adoption of GitOps workflows and Infrastructure as Code (Terraform, Helm, Crossplane) to enhance automation and consistency.
Mentorship & Team Growth: Provide technical mentorship within the platform engineering team and contribute to knowledge-sharing across R&D.
Cross-Team Collaboration: Work closely with engineering teams to align cloud infrastructure goals with business needs and reliability requirements.
Technology Assessment: Assess and advocate for new technologies that improve reliability, efficiency, and scalability within the platform.
Requirements:
Technical Expertise:
5+ years of DevOps, or SRE experience
3+ years working with public cloud platforms (AWS, GCP) at scale
Deep Kubernetes expertise, including managing large-scale, multi-cluster enterprise-grade Kubernetes environments
Experience designing and managing Custom Resource Definitions (CRDs) and custom controllers
Strong background in Infrastructure as Code (Terraform, Helm) and GitOps principles (ArgoCD, Crossplane, FluxCD, etc.)
Hands-on experience in observability & monitoring (Prometheus, Grafana, Datadog, OpenTelemetry, etc.)
Proficiency in scripting & automation (Python, Go, Bash) for infrastructure automation
Expertise in cloud networking (VPC, load balancers, service meshes) and security best practices (RBAC, IAM, security groups, network policies, etc.)
Experience with CI/CD pipelines, optimizing for performance, security, and developer velocity
Nice-to-Have:
Experience with self-hosted on-prem deployments and managed private VPC deployments (Bring Your Own Cloud models)
Advanced expertise in Helm and Crossplane for Kubernetes resource management.
Other cloud provider experience
Experience in GenAI or large-scale SaaS platforms
Familiarity with SQL/NoSQL databases and distributed systems
DevSecOps experience, with a strong understanding of security automation and compliance frameworks.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8218726
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
18/06/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are seeking a talented, self-driven and passionate Senior Infrastructure Engineer to build and maintain the cloud infrastructure for our highly available SaaS application as well as our machine learning and data engineering stack.

As a Senior Infrastructure Engineer, you will be responsible for designing, implementing, and maintaining the cloud infrastructure and DevOps processes that power our products and internal tooling. You will work closely with all data and development teams and lead the companys security and compliance vectors. You will ensure a highly reliable, scalable, and secure infrastructure that supports our rapid growth and product innovation, while maintaining observability and cost-effectiveness of our cloud resources and data.

What Youll Do

Cloud Infrastructure Management: Architect, deploy, and manage our cloud infrastructure (AWS), ensuring high availability, scalability, and security.
Software Engineering: Be a top notch SW engineer, harnessing your coding and architectural skills, as well as researching skills, for our infra stack.
Infrastructure as Code (IaC): Define and maintain infrastructure using tools like Terraform, CloudFormation, or Pulumi to manage resources efficiently and reproducibly.
Monitoring & Incident Management: Build and manage monitoring and alerting systems to ensure uptime, and respond to incidents with root cause analysis and remediation.
DevOps & Automation: Implement and maintain CI/CD pipelines to streamline development workflows and automate deployment processes across development, staging, and production environments, and across different parts of our solution. While our development teams are expected to write and maintain their own CI, you will act as a supervisor and professional authority, and maintain cross team and complex automation.
Collaboration and technical leadership: Partner with software engineers, data engineers, and machine learning teams to support their infrastructure needs and guide the evolution of our infrastructure team.
Cost Optimization: Monitor cloud spend and optimize resources to ensure cost-effective infrastructure without sacrificing performance or security.
Security & Compliance: Implement security best practices, including access control, network security, monitoring and ensuring the infrastructure is compliant with relevant industry standards (e.g., SOC2, GDPR).
דרישות:
5+ years of hands-on experience in cloud infrastructure, DevOps and platform engineering in production environments.
Expertise in managing cloud infrastructure on at least one of the major providers: AWS, GCP, Azure. Proficient in Infrastructure as Code tools such as Terraform, CloudFormation, or Pulumi.
Solid experience with Docker and Kubernetes.
Monitoring & Logging: Hands-on experience with monitoring tools (Prometheus, Grafana) and logging systems (ELK, Splunk, or equivalent).
Proficient Software engineering, architecture, as well as scripting languages such as Python, Bash, or Go. Full control of version control systems such as Git.
Strong experience with CI/CD pipelines and automation using Jenkins, CircleCI, GitHub Actions, GitLab CI, or similar.
Strong understanding of cloud networking, VPNs, VPCs, DNS, and firewalls.
Experience implementing cloud security best practices, including IAM, encryption, and key management.
Previous experience in a fast-paced startup environment, where adaptability and hands-on execution are key.
Strong communication skills and ability to work cross-functionally with different teams.
Advantages:

Experience supporting machine learning pipelines and deploying ML models to production environments.
Familiarity with data engineering tools like Apache Spark, Airflow, or similar ETL tools.
Experience with serverless technologies such as AWS Lambda, GCP Functions, or Azure Functions.
AWS Certified Security Specialty, or equivalent certifications in cloud security.
Experience and knowledge with regulatory complia המשרה מיועדת לנשים ולגברים כאחד.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8222324
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
17/06/2025
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Join our DeviantArt team as a Senior DevOps Engineer and play a pivotal role in maintaining and architecting a robust infrastructure that powers one of the largest online art communities. You'll be at the forefront of ensuring our platform's high availability, performance, and security, handling over 1.5 billion monthly page views.
The DeviantArt DevOps Team is a very small remote team that performs all tasks normally inclusive of SRE/DevOps/Infrastructure Engineers, with a bit of networking, security, and database administration mixed in. We are responsible for the day-to-day management and implementation of large-scale, mission-critical production systems that run on a public cloud.
This role requires wearing a lot of hats, and is equal parts fun and challenging. In this role, you will:
Architect and maintain a highly available infrastructure with a focus on proactive and reactive DDOS mitigation, autoscaling, self-healing, site performance, and cost optimization
Participate in a 24/7 on-call rotation, responding swiftly to outages or performance issues, and focus on less urgent alerts during normal work hours
Maintain and develop a developer environment and CI/CD pipelines in parity with production systems, for seamless testing and release of changes
Automate infrastructure provisioning and management using configuration management tools, complete with tests and documentation
Optimize and support sharded MySQL databases for efficient and reliable data handling amidst growing data reads and writes
Regularly update system components to avoid security issues and ensure up-to-date technology
We take our work seriously, but we dont take ourselves too seriously! We enjoy designing and building systems using open source tools and industry standards, and are in the fortunate position to be able to make decisions as a team about adopting newer technologies, and redesigning our infrastructure when appropriate.
This role is on a fully remote and distributed team, and asynchronous communication within and across teams is crucial. To be successful in this role, a candidate will need to work flexibly, balancing server and service issues, needs from development teams, security needs, and shifting priorities in our own tasks in managing our infrastructure.
Requirements:
5+ years of experience managing systems at scale as a DevOps Engineer, Site Reliability Engineer, or Platform Engineer
Excellent technical analytical skills with the ability to implement DDOS mitigation, troubleshoot complex problems, analyze system bottlenecks, and implement effective solutions, from frontend through backend systems, sometimes during production degradation or outage for a high traffic site
Exceptional command line Linux skills, with proficiency in Bash and Python for investigation of server and services issues, scripting, and automation
In-depth knowledge of AWS services, infrastructure as code using Terraform, GitOps tools and methodologies, and container orchestration using Docker, Helm, and Kubernetes
Experience with setup, administration, and maintenance of sharded MySQL database clusters while maintaining no downtime or data loss
Excellent communication skills with fluent English, and the ability to collaborate effectively across teams while articulating technical concepts to non-technical stakeholders
The ability to get up to speed on systems, make decisions, be flexible, and execute independently with attention to detail for production systems.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8220324
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer- Infra
Realize your potential by joining the leading performance-driven advertising company!
As a Site Reliability Engineer- infra, on our Infrastructure team at the TLV office, you will play a key role in ensuring the reliability, scalability, and performance of our critical systems. You will be responsible for managing and improving our core infrastructure, with a focus on automation, monitoring, and incident response. You will work with a wide range of technologies, including Kubernetes, monitoring and observability tools, configuration management systems, and core networking services.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure the reliability, availability, and performance of our infrastructure services.
Manage and maintain our Kubernetes infrastructure, including KubeVirt.
Design, implement, and maintain our monitoring and observability stack (SensuGo, VictoriaMetrics, Prometheus, ELK).
Automate infrastructure provisioning, configuration, and deployment processes using Puppet and Ansible.
Manage and maintain core services such as DNS and networking.
Troubleshoot and resolve complex infrastructure issues in a timely and efficient manner.
Participate in on-call rotations and incident response.
Develop and maintain infrastructure-as-code (IaC).
Identify and implement proactive measures to prevent incidents and improve system reliability.
Collaborate with development teams to ensure smooth and reliable deployments.
Contribute to the design and implementation of new infrastructure solutions.
Drive improvements in system architecture, processes, and tools.
Mentor and coach other team members.
Requirements:
To thrive in this role, youll need:
5+ years of experience in a Site Reliability Engineering, Systems Engineering, or similar role.
Deep understanding of Site Reliability Engineering principles and practices.
Extensive experience with Kubernetes, including deployment, management, and troubleshooting.
Strong experience with monitoring and observability tools such as SensuGo, Zabbix, VictoriaMetrics, Prometheus, and ELK.
Proficiency in configuration management tools such as Puppet and Ansible.
Solid understanding of Linux internals and networking.
Experience with managing and maintaining core services such as DNS and networking.
Strong programming skills in Python and/or Go.
Experience with both on-premises and cloud environments.
Experience with KubeVirt.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Ability to participate in on-call rotations including weekends.
Preferred Qualifications:
Experience with large-scale, distributed systems.
Experience with other cloud providers (e.g., AWS, Azure, GCP).
Contributions to open-source projects.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8205377
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Required Site Reliability Engineer
Realize your potential by joining the leading performance-driven advertising company!
As Site Reliability Engineer on the IT Production team in our TLV Office, youll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.
How youll make an impact:
As a Site Reliability Engineer, youll bring value by:
Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.
Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).
Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.
Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.
Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.
Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
Our Tech Stack:
Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.
Requirements:
7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
Experience supporting, troubleshooting and scaling large distributed systems in production.
Deep understanding of HTTP protocol, including HTTP/1.1, HTTP/2, caching semantics, TLS and gRPC delivery.
Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
Deep understanding in Linux system internals and system performance tuning.
Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
Experience with containerization technologies (Kubernetes, Docker).
Deep understanding of networking principles (TCP/IP, DNS, load balancing).
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8205371
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
At our company, we are building an AI-First company, leveraging cutting-edge technology to revolutionize climate intelligence.
our company's engineering department is focused on building life changing software and products at scale, from infrastructure that handles massive amounts of data to outstanding customer-centric user experiences in B2B, B2C and B2D products that change billions of lives worldwide.
We're looking for a DevOps Engineer who'll bring the dev stream from build through deploy to production, to maximum efficiency, reliability, security and stability. You'll integrate AI capabilities into our DevOps workflows, facilitate team independence, and maintain adaptive infrastructure that supports our business strategy. By continuously reducing costs and increasing velocity, you'll work in a team effort to achieve our goals. You'll help us make sure we are building the biggest weather platform in the world.
As a DevOps Engineer at our company, You'll:
Develop and adopt AI-powered tools to make Development and Operations processes more efficient
Collaborate with developers and weather scientists to optimize service performance, reliability, scale, security, and cost
Evolve and maintain adaptive cloud infrastructure to support our business strategy and enable smooth growth at scale
Build self-service platforms for scientists and developers to work independently
Introduce and integrate MLOps practices for GPU-based model deployment on Kubernetes
Maintain Production availability by participating in DevOps on-call shifts.
Requirements:
At least 3 years of experience as a DevOps/SRE Engineer in a Linux environment, experienced with AWS, GCP, or Azure and IaC, such as Terraform or Crossplane
Experience with CI/CD tools and deployment methodologies in Kubernetes
Strong sense of ownership and accountability for service reliability
Comfort with AI-powered development tools and willingness to experiment with new technologies and methods
Experience implementing and customizing monitoring systems (Datadog, Prometheus, ELK Stack)
Experience working in an agile environment with high-velocity teams
Proficiency with scripting languages like Python, Node.js, and Go
Adaptable problem-solving mindset - thriving in changing environments and requirements
Data-driven decision-making mindset, with a passion for leveraging AI, automation, and analytics to solve complex challenges.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8239733
סגור
שירות זה פתוח ללקוחות VIP בלבד