Join the Workload Availability team as a Software Engineer in Ecosystem Engineering. Your core mission is to safeguard the stability and availability of mission-critical workloads on our OpenShift by developing and integrating robust proactive and reactive remediation mechanisms. You will focus on the complex challenges of node health checking, automated fencing, and self-healing within the cluster, ensuring these critical availability features integrate seamlessly with the diverse ecosystem of third-party hardware, cloud platforms, and infrastructure providers.
What You Will Do
Maintain critical components, controllers, and operators (primarily in Go) responsible for detecting, diagnosing, and automating the recovery of unhealthy cluster nodes on OpenShift.
Become responsible for the quality of our offerings, participate in peer code reviews and continuous integration (CI), and software release process using Konflux CI/CD system.
Develop intelligent node mechanisms to accurately determine a node's operational status and trigger the appropriate self-healing or external remediation actions.
Contribute upstream to projects that focus on Kubernetes-native machine and node remediation to advance the platform's self-healing capabilities.
Troubleshoot and resolve complex cross-ecosystem availability and reliability failures, requiring deep debugging into kernel-level behavior, cloud infrastructure APIs, and Kubernetes control plane logic.
Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality.
Requirements: What You Will Bring:
2+ years of experience working in a Linux environment with at least one language like Golang, Python, Java, or C or C++
Experience with a container ecosystem like Docker, Kubernetes, Red Hat OpenShift.
Excellent analytical and debugging skills, capable of diagnosing failures across operating system, container runtime, and Kubernetes control plane boundaries.
Strong communication and collaboration skills for successful engagement with both internal engineering teams and external ecosystem partners.
The following is considered a plus:
Familiarity with High Availability (HA) concepts, particularly focusing on node fencing (STONITH), cluster remediation, and machine lifecycle management.
Familiarity with the concepts and implementation of the Machine API in Kubernetes/OpenShift.
Familiarity with common fencing protocols or power management interfaces (e.g., IPMI, Redfish, or cloud-specific compute APIs).
Active contributions to upstream Kubernetes, OpenShift, or related cluster lifecycle projects.
Knowledge of storage-related HA concerns, such as the impact of node failure and fencing on data integrity.
This position is open to all candidates.