Base Command Manager Engineer - NVIS NPI at NVIDIA

Santa Clara, California, United States

Apply Now

Not SpecifiedCompensation

Senior (5 to 8 years)Experience Level

Full TimeJob Type

UnknownVisa

AI, High Performance Computing, SemiconductorIndustries

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience)
10+ years of experience in at least two of the following: HPC/large-scale cluster administration, Linux systems engineering, infrastructure automation (e.g., Ansible, Salt), or data center operations
5+ years of direct, hands-on experience provisioning, managing, and fixing clusters using NVIDIA Base Command Manager (BCM)
Deep, practical knowledge of how Slurm and Kubernetes is coordinated, deployed, and managed by BCM, including workload submission and resource management
Proficiency in Python and Bash scripting for automation, cluster validation, and workflow optimization
In-depth experience with cluster management and monitoring tools (e.g., Prometheus, Grafana, DCGM, and similar observability stacks)
Outstanding written and verbal communication skills, with the ability to explain complex technical concepts to both technical and non-technical collaborators
A customer-first attitude, self-motivation, and a proactive approach to leadership in diverse environments
Ways to stand out
Proficiency with cluster networking including InfiniBand and Spectrum-X
Experience with NVIDIA Mission Control
Familiarity with CI/CD workflows in an infrastructure context, including tools such as Git, GitLab, and Jenkins
Background in Professional Services, customer-facing deployment, and solutions optimization
Industry certifications such as CKA/CKAD (Certified Kubernetes Administrator/Developer), RHCE, or other advanced Linux/HPC credentials

Responsibilities

Play a key role in NVIDIA’s NPI team, acting as the link between engineering and the NVIS field team for cluster deployment and management solutions
Collaborate closely with engineering and product teams to review and influence design decisions for products centered around large-scale, BCM-managed clusters
Evaluate changes in BCM and underlying OS/software stacks, communicating the impact to the field organization to maintain robust and scalable deployment workflows
Define and relay detailed cluster management requirements to engineering, enabling the successful New Product Introduction (NPI) of next-generation GPU platforms
Describe architectural and design changes, build clear and actionable tasks for the field, including standardized deployment guides, configuration standard methodologies, and validation workflows
Validate complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability, and resilience, ensuring they meet real-world customer scenarios
Support the NPI team by bridging knowledge gaps, tracking progress, and aligning collaborators throughout the product development lifecycle
Support NVIDIA's mission by ensuring our breakthrough technologies are successfully deployed for global customers and OEM partners

Skills

Base Command Manager

Slurm

Kubernetes

Cluster Management

GPU Platforms

NPI

Deployment Guides

Validation Workflows

OS Software Stacks

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters

1993Year Founded

$19.5MTotal Funding

IPOCompany Stage

Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries

10,001+Employees

Benefits

Company Equity

401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.

Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.

Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.

The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.

NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.

Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.

Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI

Try Jobo Free