Required Site Reliability Engineer Panda team
Realize your potential by joining the leading performance-driven advertising company!
As Site Reliability Engineer on the IT Production team in our Tel Aviv Office, youll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.
As a Site Reliability Engineer, youll bring value by:
Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.
Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).
Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.
Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.
Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.
Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
Our Tech Stack:
Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.
Requirements: 4+ years of experience in software development with a proven track record of designing and developing internal tools, automation frameworks and platform components in large-scale distributed production environments with focus on linux operating systems.
Deep, demonstrable expertise in one of the following programming languages ( Golang, C, Rust, Python or Java).
Experience in observability tooling development, specifically implementing custom metrics, tracing and logging within application code.
Practical understanding of the HTTP protocol (including HTTP methods, status codes and headers). Proven ability to design, implement and instrument robust internal APIs (e.g., using REST or gRPC).
Understanding in Linux operating system internals: kernel configuration, system calls, process management, memory and I/O.
Proven ability to troubleshoot and optimize performance bottlenecks under heavy load using advanced monitoring and profiling tools for high-throughput and low-latency applications.
Bonus points if you have:
Experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
This position is open to all candidates.