Senior HPC Dev Ops Engineer at NVIDIA

Westford, Massachusetts, United States

NVIDIA Logo
Not SpecifiedCompensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
High Performance Computing, Quantum ComputingIndustries

Requirements

  • Proven experience (12+ years) in HPC systems engineering or administration within large-scale Linux-based GPU environments
  • Extensive expertise in Slurm, Linux systems administration, and proficiency in configuration management tools like Ansible or Base Command (previously known as Bright Computing)
  • Practical familiarity with NVIDIA GPU technologies, InfiniBand, RDMA, and high-speed networking configurations
  • Proficiency in containerization and orchestration tools like Singularity, Docker, or Kubernetes
  • Knowledge of data center operations, including rack power, cooling methods (such as liquid-cooled systems), and network management
  • Ability to automate, script, install/compile applications, and optimize performance
  • Bachelor’s degree in Computer Science, Electrical/Computer Engineering, Physics, or equivalent experience
  • Outstanding problem-solving and diagnostic skills, and the ability to operate in a multidisciplinary, high-performance environment

Responsibilities

  • Build and operate a brand new hybrid compute environment spanning HPC and quantum systems
  • Lead Linux provisioning, configuration management, and system tuning across hundreds of GPU nodes and supporting infrastructure
  • Coordinate and optimize Slurm job scheduling — define policies, handle QoS, tune workloads, and help users translate research requirements into efficient batch workflows
  • Coordinate data center tasks, partner with data center operations teams, connect with quantum lab
  • Integrate and sustain container orchestration (e.g., Singularity, Docker, or Kubernetes for HPC) to back simulation workloads and quantum job processing
  • Run storage environment consisting of Lustre, NFS, and Cloud storage
  • Work closely with quantum engineering teams to merge quantum control nodes, orchestration gateways, and facilitate data exchange between HPC and quantum systems
  • Address and improve performance for complex hybrid workloads, covering CUDA, MPI, and CUDA-Q applications
  • Develop and automate operational workflows with Ansible, GitHub Actions, and CI/CD pipelines
  • Support researchers and developers with environment setup, debugging, and performance profiling on NVIDIA hardware and quantum simulators
  • Serve as the primary systems administrator and reliability owner for the GB200 GPU infrastructure

Skills

Linux
Slurm
Ansible
Kubernetes
Docker
Singularity
Lustre
NFS
CUDA
MPI
CUDA-Q
GitHub Actions
CI/CD

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI