As a Solution Engineer, you will play a pivotal role in designing, deploying, and optimizing our Network Cloud AI Infrastructure solutions. This individual contributor role requires a blend of technical expertise, leadership, and hands-on experience to implement cutting-edge solutions for our customers. You will collaborate with sales engineering teams, customers, and cross-functional teams - including Product Management, Solution Architects, Engineering, and Marketing - to define technical requirements, articulate solution value, and ensure successful deployment on-site.
Key responsibilities include guiding customers through the design and deployment process, aligning technical solutions with business needs, and providing critical feedback to improve our product offerings. This position demands strong technical acumen, exceptional communication skills, and the ability to lead complex, high-impact projects in dynamic environments.
Responsibilities:
Building robust AI/HPC infrastructure for new and existing customers.
Technical hands-on role in building and supporting NVIDIA/AMD based platforms.
Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, training stability, real-time monitoring, logging, and alerting.
Administer Linux systems, ranging from powerful GPU enabled servers to general-purpose compute systems.
Design and plan rack layouts and network topologies to support customer requirements.
Design and evaluate automation scripts for network operations, configuring server and switch fabrics.
Perform Data Center upgrades and ensure smooth deployment of our solutions.
Install and configure our products, ensuring optimal performance and customer satisfaction.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
Engage with sales teams and customers to ensure success with major opportunities and deployments.
Introduce new products to the sales and support teams and to the customers.
Deliver technical trainings and TOIs for support/sales engineers, partners, and customers.
Collaborate on product definition through customer requirement gathering and roadmap planning.
Requirements: Requirements:
5+ years of previous experience deploying and administrating AI/HPC clusters or general-purpose compute systems.
5+ years of hands-on Linux experience (e.g., RHEL, CentOS, Ubuntu) and production infrastructure support (e.g., networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement)
Proficiency in Cloud, Virtualization, and Container technologies.
Deep understanding of operating systems, computer networks, and high-performance applications
Hands-on experience with Bash, Python, and configuration management tools (e.g., Ansible).
Established record of leading technical initiatives and delivering results.
Ability to write extensive technical content (white papers, technical briefs, test reports, etc.) for external audiences with a balance of technical accuracy, strategy, and clear messaging
Ability to travel Domestic and international up to 20% of the time
Ways to stand out from the crowd:
Familiarity with AI-relevant data center infrastructure and networking technologies such as: Infiniband, RoCEv2, lossless Ethernet technologies (PFC, ECN, etc), accelerated computing, GPU, DPU, etc.
Familiarity with GPU resource scheduling managers (Slurm, Kubernetes, etc.).
Familiarity with monitoring tools (e.g., Prometheus, Grafana, ELK Stack) and Telemetry (gRPC, gNMI, OTLP, etc).
Understanding of data center operations fundamentals in networking, cooling, and power.
This position is open to all candidates.