דרושים » מחשבים ורשתות » Senior Site Reliability Engineer (Cortex)

משרות על המפה
 
בדיקת קורות חיים
VIP
הפוך ללקוח VIP
רגע, משהו חסר!
נשאר לך להשלים רק עוד פרט אחד:
 
שירות זה פתוח ללקוחות VIP בלבד
AllJObs VIP
כל החברות >
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 18 שעות
Location: Tel Aviv-Yafo
Job Type: Full Time
Join a team of senior engineers operating in a large-scale, multi-cloud production environment supporting tens of thousands of enterprise customers worldwide. This is not a typical SRE role - youll work at the core of a complex, high-impact system alongside experienced DevOps professionals in a fast-paced, cybersecurity-focused organization.
Your Impact:
Own and operate large-scale, global production environments across multiple cloud providers (GCP, AWS, Azure)
Actively monitor, investigate, and resolve incidents triggered by automated alerting systems (PagerDuty / Incident Response)
Drive end-to-end troubleshooting across complex, distributed systems with high context switching
Design, deploy, and improve monitoring and observability systems (e.g., Prometheus, Grafana) - not just react to alerts
Collaborate closely with internal teams (CX, CS, Engineering) to ensure system reliability and performance
Work hands-on with modern DevOps and infrastructure tools including Kubernetes, Terraform, CI/CD pipelines, and GitOps workflows
Develop and maintain automation and tooling (primarily in Python)
Gain deep understanding of system architecture and interconnected services
Contribute to a culture of operational excellence in a high-scale, high-availability environment
On call responsibilities:
Daytime hours (12:00-20:00)
Occasional weekends and holidays (rotation-based).
Requirements:
Your experience:
5+ years of experience in SRE roles in production environments at scale
Strong hands-on experience with Kubernetes and Terraform
Strong hands-on experience with at least one major cloud platform (GCP or AWS required)
Experience building and configuring monitoring systems (e.g., Prometheus, Grafana)
Familiarity with CI/CD and GitOps tools (GitLab CI, GitHub Actions, Jenkins, Flux)
Proficiency in Python for scripting and automation
Strong troubleshooting and problem-solving skills with a passion for incident handling
Ability to work in fast-paced environments with high context switching
Highly responsive, proactive, and ownership-driven
Strong collaboration and communication skills
Curious mindset and eagerness to learn.
This position is open to all candidates.
 
Hide
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8704900
סגור
שירות זה פתוח ללקוחות VIP בלבד
משרות דומות שיכולות לעניין אותך
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
24/05/2026
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for an experienced SRE Team Lead to drive the reliability, observability, and automation practices across our private cloud infrastructure and operations. In this role, you will lead a team of site reliability engineers, own the engineering roadmap for monitoring and automation, and act as a key liaison between development, operations, and platform teams. You bring at least 3-4 years of hands-on people management experience and a deep technical background in SRE or DevOps disciplines.



What will you do?

Leadership & Team Management

Lead, mentor, and grow a team of SREs, providing technical direction, career development guidance, and day-to-day management.

Own the team roadmap for reliability, observability, and automation initiatives - prioritizing work, removing blockers, and driving delivery.

Conduct regular 1:1s, performance reviews, and hiring processes to build and sustain a high-performing team.

Foster a culture of operational excellence, blameless post-mortems, and continuous improvement.

Act as an escalation point for complex incidents and reliability issues, leading post-incident reviews and ensuring follow-through on action items.


Automation & Infrastructure

Design, develop, and maintain automation tools to support infrastructure and operations teams at scale.

Manage pipelines and infrastructure workflows using Jenkins, Ansible, Python, and Bash.

Drive the adoption of infrastructure-as-code practices across the organization.

Collaborate with system engineers to improve scalability, performance, and fault tolerance of critical systems.


Monitoring & Observability

Build and extend monitoring and alerting systems using Grafana, the ELK (Elastic) stack, Zabbix, and custom scripts.

Implement and enforce observability best practices to ensure full visibility into systems, applications, and infrastructure.

Define and track SLIs, SLOs, and error budgets across key services.

Partner with development teams to embed observability earlier in the software development lifecycle.


Database & Platform Support

Support monitoring and infrastructure integration for databases including MongoDB and PostgreSQL.

Maintain documentation and champion knowledge sharing around automation, monitoring, and reliability practices.
Requirements:
What you need:

Experience & Leadership

3-4+ years of experience in a people management or team lead capacity within SRE, DevOps, or infrastructure engineering.

5-8+ years of overall experience in SRE, DevOps, or infrastructure automation roles.

Proven track record of building, coaching, and retaining high-performing engineering teams.

Experience owning an engineering roadmap and driving cross-functional reliability initiatives.



Technical Skills

Strong scripting skills in Python and Bash; comfortable building and maintaining production-grade automation.

Hands-on experience with infrastructure automation tools, particularly Ansible.

Solid experience with monitoring and observability platforms - ELK stack, Grafana, and Zabbix.

Good understanding of CI/CD pipelines and related tooling, including Jenkins.

Familiarity with managing and monitoring MongoDB and PostgreSQL in a production environment.

Comfortable working in Linux-based environments.

Excellent problem-solving skills and strong written and verbal communication.



Ability to support the following:

Experience with cloud providers - AWS, GCP, or Azure.

Exposure to containerization technologies such as Docker and Kubernetes.

Familiarity with infrastructure provisioning using Terraform.

Experience introducing SRE practices (SLOs, error budgets, chaos engineering) at an organizational level.

Exposure and experience with migrating/ building AI tools to improve process.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8662300
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
1 ימים
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
As a Senior IT SRE Engineer, you will be a key player in ensuring the reliability, scalability, and performance of our critical IT infrastructure. You will leverage SRE principles and an automation-first mindset to build and maintain resilient hybrid cloud environments. This role is ideal for a candidate who thrives in a fast-paced, innovative setting and is passionate about solving complex challenges with cutting-edge technology.
Key Responsibilities
Provision, configure, and support resilient hybrid cloud deployment architectures using an Infrastructure-as-Code framework.
Proactively collaborate with development teams to ensure new applications are production-ready, scalable, and reliable from inception.
Develop and maintain tools and frameworks to automate operational tasks, including deployment, monitoring, and recovery.
Conduct thorough root cause analysis of production issues and implement preventative measures to improve system resilience, demonstrating strong problem-solving skills.
Manage CI/CD platforms, Linux infrastructure, and contribute to capacity planning and operational runbooks.
Design and implement proactive service monitoring, alerting, and trend analysis to maintain service availability and performance SLAs.
Participate in an on-call rotation to support critical applications and services, responding to and resolving incidents efficiently.
Contribute to comprehensive documentation related to infrastructure design, deployment, and operational procedures.
Requirements:
Your Expereience:
Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
6+ years of Devops engineering experience on mission-critical, enterprise-level systems in a hybrid (both cloud and on-prem) environment.
3+ years of hands-on experience with cloud environments, preferably Google Cloud Platform (GCP).
Expertise in configuration management and Infrastructure-as-Code using frameworks such as Terraform and Ansible.
Strong programming/scripting knowledge in languages like Python, Bash, or Go for infrastructure automation.
Demonstrated experience with CI/CD pipelines (e.g., GitHub, Jenkins, Artifactory) and a strong foundation in Linux/Unix administration.
Preferred Qualifications
Experience with containerization and orchestration technologies, particularly Kubernetes.
Hands-on experience with monitoring and observability tools such as Datadog, Grafana, or Prometheus.
Understanding of networking principles including firewalls, load balancers, and complex network designs.
A curious and positive mindset with a passion for applied learning and challenging existing processes for continuous improvement.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8703154
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
24/05/2026
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for an experienced SRE Engineer to drive the reliability, observability, and automation practices across our private cloud infrastructure and operations. In this role, you will be in a team of site reliability engineers, own the engineering roadmap for monitoring and automation, and act as a key liaison between development, operations, and platform teams. You bring at least 4+ years of hands-on people management experience and a deep technical background in SRE or DevOps disciplines.

What will you do?

Automation & Infrastructure
- Design, develop, and maintain automation tools to support infrastructure and operations teams at scale.
- Manage pipelines and infrastructure workflows using Jenkins, Ansible, Python, and Bash.
- Drive the adoption of infrastructure-as-code practices across the organization.
- Collaborate with system engineers to improve scalability, performance, and fault tolerance of critical systems.

Monitoring & Observability
- Build and extend monitoring and alerting systems using Grafana, the ELK (Elastic) stack, Zabbix, and custom scripts.
- Implement and enforce observability best practices to ensure full visibility into systems, applications, and infrastructure.
- Define and track SLIs, SLOs, and error budgets across key services.
- Partner with development teams to embed observability earlier in the software development lifecycle.

Database & Platform Support
- Support monitoring and infrastructure integration for databases including MongoDB and PostgreSQL.
- Maintain documentation and champion knowledge sharing around automation, monitoring, and reliability practices.
Requirements:
What you need:

4+ years of overall experience in SRE, DevOps, or infrastructure automation roles.

Strong scripting skills in Python and Bash; comfortable building and maintaining production-grade automation.

Hands-on experience with infrastructure automation tools, particularly Ansible.

Solid experience with monitoring and observability platforms - ELK stack, Grafana, and Zabbix.

Good understanding of CI/CD pipelines and related tooling, including Jenkins.

Familiarity with managing and monitoring MongoDB and PostgreSQL in a production environment.

Comfortable working in Linux-based environments.

Excellent problem-solving skills and strong written and verbal communication.


Ability to support the following:
Experience with cloud providers - AWS, GCP, or Azure.
Exposure to containerization technologies such as Docker and Kubernetes.
Familiarity with infrastructure provisioning using Terraform.
Experience introducing SRE practices (SLOs, error budgets, chaos engineering) at an organizational level.
Exposure and experience with migrating/ building AI tools to improve process.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8662378
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
Required DevOps Engineer
About the role:
Our DevOps team operates the infrastructure that powers our AI and Computer Vision platform across construction sites in 15+ countries. From data pipelines and ML workloads to backend services - you'll work with a diverse, modern, Kubernetes-based stack and have real influence on how we build, deploy, and operate.
What you'll do:
Own Multi-Cloud Infrastructure: Work alongside the team to design, scale, and operate our high-scale, multi-region production infrastructure across AWS and GCP, powering construction sites globally.
Drive Kubernetes at Scale: Manage and evolve our Kubernetes platform on EKS and leveraging GitOps practices with ArgoCD and Helm to enable safe, fast, and reliable deployments.
Build Robust CI/CD: Design and maintain CI/CD pipelines that empower dozens of engineers to ship confidently - with automation, testing, and progressive delivery built in.
Tackle Diverse Infrastructure Challenges: Work hands-on with a wide variety of workloads - from heavy data processing and Computer Vision pipelines to backend services and ML inference - each with unique scaling, performance, and reliability requirements.
Ensure Reliability & Observability: Build and maintain world-class observability (metrics, logs, tracing, alerting) so that issues are caught early and resolved fast. Performance, reliability, and scalability are at the core of what you do.
Security & Cost: Partner with the team to strengthen our security posture, identity and access management, compliance, and cloud cost optimization across both clouds.
Ownership from 0 to 1: You will have real influence over our architecture and tooling. We want engineers who care about shaping what we build and how we build it, ensuring performance, security, and observability are baked in from day one.
Requirements:
A seasoned DevOps / Infrastructure engineer (5+ years) with strong hands-on experience in production cloud environments.
Proven expertise operating large-scale, distributed systems - with deep understanding of Kubernetes, networking, and cloud-native architecture.
Strong experience with multi-cloud environments (AWS and/or GCP), Infrastructure-as-Code (Terraform), and GitOps workflows (ArgoCD, Flux, or similar).
Hands-on experience with CI/CD systems (Jenkins, GitHub Actions, etc.).
Solid scripting and automation skills (Python, Bash, or Go).
Proven track record of being a collaborative team player who partners closely with developers, ML engineers, and cross-functional stakeholders across the organization.
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Logz.io, or similar).
Experience with databases (relational and/or NoSQL) - including operational aspects like backups, migrations, and performance tuning.
AI-Native Engineering: You are an AI-native engineer who leverages LLMs and agentic tools (like Cursor, Copilot, or Claude) not just for command completion, but as a core operational partner - automating diagnostics, runbooks, and infrastructure workflows so you can focus on the critical things.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8670484
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
לפני 16 שעות
Location: Tel Aviv-Yafo
Job Type: Full Time
Your Career:
Own and continuously improve AWS production infrastructure for scalability, reliability, security, performance, and cost.
Run and evolve Kubernetes environments that support fast, safe product delivery.
Drive developer velocity and production safety through better CI/CD pipelines, release workflows, deployment visibility, and GitOps practices.
Improve observability and incident response - reduce alert noise and raise signal quality.
Design and ship AI-assisted operational agents that change how engineers work - triaging monitoring alerts, summarizing incidents, proposing fixes, onboarding new services, answering questions and requests. This is a core part of the role, not a side project.
Build automation and self-service tooling that removes manual work from provisioning, monitoring, incident response, and developer workflows.
Analyze operational data across incidents, alerts, deployments, infra health, and cost to find reliability gaps, inefficiencies, and automation opportunities.
Partner with engineering, security, product, and leadership to remove bottlenecks and support safe production growth.
Evaluate and introduce new tools and AI-assisted approaches, balancing innovation with reliability, cost, and operational simplicity.
Your Impact:
You'll help scale production systems, improve deployment velocity and reliability, reduce operational overhead, and build automation and AI workflows that help engineering teams move faster and operate more efficiently.
This role is a strong fit for someone who enjoys ownership, collaboration, and operational innovation.
Requirements:
Your Experience:
4+ years operating production infrastructure in AWS.
Deep hands-on experience with Kubernetes, Helm, ArgoCD, Terraform, and CI/CD.
Strong experience with observability and alerting in Datadog or comparable platforms.
Solid grounding in Linux, networking, cloud security, and reliability best practices.
Strong scripting skills in Python and Bash.
Proven ability to own platform projects end-to-end, from design through production operation and ongoing improvement.
Strong troubleshooting across distributed systems, Kubernetes, CI/CD, and live incidents.
Collaborative mindset - comfortable working across engineering, security, product, and leadership.
Comfort in a fast-paced, high-ownership environment where priorities shift but production quality doesn't.
Genuine interest in applying AI, automation, and intelligent workflows to operational work.
Key qualities
Ownership-driven - You take responsibility for the systems you build and operate, from design through production support and continuous improvement.
Collaboration - You work effectively across engineering, security, product, and leadership to align priorities and drive shared outcomes.
Developer experience focus - You are committed to reducing friction for engineering teams through thoughtful automation, self-service workflows, and reliable internal tooling.
Innovation balanced with pragmatism - You actively explore new approaches, particularly in AI-assisted operations, while weighing them against reliability, maintainability, and operational simplicity.
Security mindset - You design and build with least privilege, auditability, and production safety as foundational principles rather than afterthoughts.
Clear communication - You articulate infrastructure, reliability, cost, and security tradeoffs precisely to both technical and non-technical stakeholders.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8705246
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time
We are seeking a skilled and motivated DevOps engineer with deep familiarity in the streaming ecosystem to join our elite infrastructure team. If you're excited by the challenge of operating mission-critical systems at scale and optimizing the developer experience through automation and tooling, wed love to hear from you.

What you will do:

Automate Deployment and Operation
Oversee deployment of Kafka and RabbitMQ clusters (including Confluent Cloud & CFK). Build automation pipelines to ensure repeatability and resiliency across environments.

Monitor and Support Production Systems
Own production stability of global Kafka clusters. Handle on-call rotations, incident management, troubleshooting, and scaling challenges.

Improve Infrastructure Observability
Build and maintain observability systems: dashboards, alerting pipelines, metrics collection (Prometheus, Grafana, etc.).

Optimize System Performance
Collaborate with peers on benchmarking and optimization initiatives. Work on tuning Kafka brokers, cluster configurations, and runtime parameters.

Provide Developer Support and Training (Infra-focused)
Help developers configure topics, quotas, and consumers appropriately. Train service owners to interpret monitoring data and avoid pitfalls.

Develop and Maintain Infrastructure
Contribute to building infrastructure tools and scripts (IaC, Helm charts, etc.) that make provisioning and managing clusters reliable and efficient.

Secure Infrastructure Access
Configure and maintain secure access patterns across streaming infrastructure, ensuring proper authentication and role-based access controls are enforced for both developers and services.
Requirements:
What we expect:

8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.

Deep hands-on Kafka experience, including deploying, maintaining, scaling, and monitoring clusters.

Experience with RabbitMQ.

Extensive experience with Docker, Kubernetes, Helm, and GitOps-style deployments.

Infrastructure as Code experience (Terraform, Pulumi, etc.).

Strong skills in scripting and automation (Python, Bash, etc.).

Familiarity with Confluent Cloud, Confluent for Kubernetes, and similar tools.

Solid understanding of authentication and authorization mechanisms in distributed systems.

Production support mindset - with proven troubleshooting and incident resolution history.

Collaboration and communication skills - especially with dev teams depending on platform support.

Experience with Istio Service Mesh (bonus).

Experience with GovCloud (bonus).


Bonus Qualities:

Mentorship and leadership experience in infrastructure or SRE teams.

Contributions to automation or monitoring open-source tooling.

Active participant in SRE or DevOps communities.

Conference speaker or internal tech trainer.

Technical writing about infrastructure automation or reliability.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8695015
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
21/05/2026
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
a global leader in performance marketing, is looking for a talented DevOps Engineer to join us on our mission to simplify decision-making for millions!
Responsibilities:
Design, own, and evolve DevOps tooling, CI/CD architecture, and infrastructure automation strategies across multi-cloud environments (AWS & GCP), supporting high-performance, resilient production systems.
Manage Kubernetes clusters (EKS, GKE) and containerized microservices at scale, leveraging Helm, IaC, and other cloud-native technologies.
Collaborate with engineering and data teams to optimize cloud-native architectures for performance, cost-efficiency, scalability, and high availability.
Automate infrastructure and pipeline workflows using Python, Bash, and Groovy, with IaC tools like Terraform and CloudFormation, and CI/CD platforms such as Jenkins and GitHub Actions.
Support data workflows and ML deployments using orchestration tools like Airflow and CI/CD for data pipelines.
Work with AI-native tooling (e.g., MCP, agent frameworks, Cursor, OpenAI, Gemini and Vertex).
Bring out-of-the-box thinking, excellent problem-solving skills, and the ability to debug complex systems.
Requirements:
3+ years of hands-on experience with AWS in production environments, with strong working knowledge of Linux-based systems for deployment, debugging, and automation.
3+ years of DevOps experience supporting production-grade systems with high availability, scalability, and operational reliability.
Strong expertise in Kubernetes-based orchestration (EKS, KOps, GKE).
Extensive experience with CI/CD tools such as Git, GitHub, Jenkins, GitHub Actions, and Nexus.
Proficiency in scripting/programming languages, including Bash, Python, or Groovy, for automating infrastructure and pipelines.
Experience with Infrastructure as Code (IaC) tools like Terraform and CloudFormation.
Experience with logging, metrics, and observability stacks, such as Datadog, Telegraf, Elasticsearch, Kibana, Prometheus, and Grafana.
Ability to troubleshoot and debug complex, distributed systems across multiple cloud environments.
Only candidates meeting the above requirements will be considered.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8660493
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
Location: Tel Aviv-Yafo
Job Type: Full Time and Hybrid work
Required Al Infrastructure & Reliability Engineer
What this role is really about
Youll join a 3-person platform team within our Business Technology group -owning the internal infrastructure that our AI platform and its users depend on. This isnt a product engineering role, and it isnt ticket work or babysitting pipelines someone else built. Youre building and operating the internal foundation that the company runs on. The work covers the full stack of platform engineering: core cloud infrastructure (AWS, Kubernetes, IaC), CI/CD pipelines, AI-driven infrastructure components, and the SRE and observability practice that keeps it all honest -metrics, alerting, incident response, and reliability standards. As our AI capabilities grow, so does the complexity underneath them, and staying ahead of that is central to the role. If you treat infrastructure as a product -reusable, automated, observable, and built to last -this is your kind of role.
Job responsibilities
DevOps & AI-Driven Infrastructure - own CI/CD, deployment processes, and release reliability. Build and operate cloud infrastructure that is automated, intelligent, and continuously self-improving - not just managed.
Design and build our Terraform repository and IaC pipeline from scratch -AI-assisted generation, drift detection, and policy enforcement built in.
Build AI-driven GitHub Actions pipelines -automated code review, risk assessment, and intelligent deployment decisions.
Manage Kubernetes workloads across AWS accounts -zero downtime, fully automated, nothing left behind.
Embed AI into the operational layer -proactive drift detection, automated remediation, and intelligent scaling toward a self-healing runtime.
Reliability & SRE -improve uptime, resilience, and incident response.
Define and enforce SLOs/SLIs, error budgets, and on-call practices.
Lead incident response, postmortems, and systemic reliability improvements.
Own AI-specific reliability: model latency SLOs, token quota monitoring, rate limit handling, fallback and retry strategies, and cost-per-request alerting.
Observability & Telemetry - increase visibility, reduce noise, improve troubleshooting.
Establish and continuously evolve the observability stack: metrics, logs, distributed tracing, and alerting tuned for both application and AI workloads.
AI / LLM Operations- bringing AI systems to production and operating them at scale, with a focus on reliability, performance, and trust.
Own the AI infrastructure layer: rate limits, quota management, latency SLOs, and fallback strategies (retries, circuit breakers).
Operate LLM APIs in production with resilience and cost attribution per team/model.
Requirements:
2-4 years Hands-on DevOps, SRE, or infrastructure engineering in production SaaS environments.
Strong AWS experience: multi-account architecture, cross-account IAM, serverless and event-driven services (Lambda, SQS, SNS, EventBridge), and EKS cluster management.
Proven Kubernetes experience in production, including cross-account migrations and stateful workload management.
Proficiency with Terraform - repository structure design, module architecture, and CI/CD pipeline implementation.
Hands-on experience building and maintaining GitHub Actions pipelines for end-to-end CI/CD workflows.
Working Python proficiency for scripting, internal tooling, and workflow automation.
Practical experience implementing observability stacks from scratch: metrics, logging, distributed tracing, and alerting.
Experience owning reliability practices: SLOs, incident response, and postmortem culture.
Nice to have
Hands-on experience operating LLM APIs in production: rate-limit and quota management, cost attribution per team/model, latency monitoring, and resilience patterns (retries, fallbacks, circuit breakers).
FinOps experience across cloud, AI, and observability spend.
Experience introducing self-healing or auto-remediation patterns in production.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8659781
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
01/06/2026
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a hands-on technical leader to drive the design, implementation, and evolution of our cloud infrastructure. Youll take ownership of building scalable, reliable, and secure systems that empower engineering teams to deliver with speed and confidence. Youll operate in a fast-paced environment, balancing innovation and pragmatism, and fostering a culture of continuous improvement and operational excellence.
Responsibilities:
Lead end-to-end technical initiatives to design and manage cloud infrastructure across AWS and other multi-cloud environments.
Build and evolve automation frameworks and internal tools using Python, Bash, and modern infrastructure-as-code technologies (Terraform or equivalents).
Architect, implement, and maintain CI/CD pipelines that streamline delivery and improve developer productivity.
Champion observability, reliability, and performance - establishing best practices around monitoring, alerting, and system health visibility.
Collaborate closely with cross-functional engineering teams to enable scalable deployments, efficient development workflows, and resilient production systems.
Drive technical tradeoff decisions that balance speed, cost, and reliability in a dynamic, growth-oriented environment.
Act as a mentor and advocate for DevOps culture, enabling teams to take ownership of infrastructure and operations.
Requirements:
6+ years of DevOps or Infrastructure Engineering experience in high-growth product environments.
Proven track record of leading or owning complex system deployments and cloud architectures end-to-end.
Expertise in cloud platforms (preferably AWS) with strong understanding of networking, security, and scalability best practices.
Proficiency in scripting and automation using Python and Bash.
Hands-on experience with Infrastructure as Code (Terraform, CloudFormation, or equivalent).
Deep familiarity with CI/CD systems (GitHub Actions, Jenkins, or similar) and container orchestration (Kubernetes, ECS, or EKS).
Experience building observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
Excellent collaboration and communication skills - a positive, pragmatic team player who thrives in high-velocity, tradeoff-driven environments.
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8674620
סגור
שירות זה פתוח ללקוחות VIP בלבד
סגור
דיווח על תוכן לא הולם או מפלה
מה השם שלך?
תיאור
שליחה
סגור
v נשלח
תודה על שיתוף הפעולה
מודים לך שלקחת חלק בשיפור התוכן שלנו :)
חברה חסויה
Location: Tel Aviv-Yafo
Job Type: Full Time
We are looking for a Senior DevOps Engineer to join our R&D team in developing the next rising product in the health tech landscape. If you are looking for a challenging, influential position and are passionate about making an impact, this might be the role for you.

As a Senior DevOps Engineer , youll play a key role in the design, development, testing, deployment, and monitoring of our infrastructure and products. In this position, you'll make significant contributions to our observability stack, helping build and maintain robust systems for logs, metrics, traces, and alerting.

Our ideal candidate is passionate about DevOps and observability, has strong communication skills, and thrives on constant improvement for both technology and processes. If you enjoy working on multiple projects in parallel and are a proactive team player, youll fit right in.

This is a unique opportunity to join the core team of a fast-growing startup, where your contributions will have a direct impact on our product and success.

Responsibilities

Support and collaborate with cross-functional engineering teams using cutting-edge technologies.
Contribute to the design, implementation, and maintenance of monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki)
Secure, scale, and manage our cloud environments (AWS and GCP)
Design and implement automation solutions for both development and production
Manage and improve our CI/CD pipelines for fast and safe delivery
Lead best practices in infrastructure, observability, configuration management, and system hardening
Continuously assess and improve existing infrastructure in line with industry standards
Requirements:
BSc in Computer Science, Engineering, or equivalent experience
5+ years of experience as a DevOps Engineer or similar software engineering role
Proven experience with Docker and Kubernetes (EKS preferred)
Hands-on experience with monitoring and observability tools, including Prometheus, Grafana, Datadog, or similar.
Expertise in Terraform for AWS infrastructure-as-code deployments
Strong collaboration and interpersonal communication skills
Excellent analytical thinking and problem-solving mindset
Proficiency with relational databases
Solid knowledge of Python and Bash scripting
Experience with test automation - an advantage
This position is open to all candidates.
 
Show more...
הגשת מועמדותהגש מועמדות
עדכון קורות החיים לפני שליחה
עדכון קורות החיים לפני שליחה
8671069
סגור
שירות זה פתוח ללקוחות VIP בלבד