NVIDIA

Senior HPC Performance Engineer

Germany

Not SpecifiedCompensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, High Performance Computing, Visualization, SemiconductorsIndustries

Performance Engineer - GPU Communication Libraries

Employment Type: Full-time

Position Overview

NVIDIA is a leader in Artificial Intelligence, High-Performance Computing (HPC), and Visualization. Our invention, the GPU, is central to modern computing, driving advancements from AI to autonomous vehicles. We are seeking a motivated Performance Engineer to influence the roadmap of our GPU communication libraries (NCCL, NVSHMEM, GPUDirect). These libraries are critical for scaling Deep Learning and HPC applications, which increasingly demand massive compute power across tens of thousands of GPUs connected via high-speed interconnects and networking. This role offers an exceptional opportunity to advance the state-of-the-art in GPU communication performance for large-scale deployments.

Responsibilities

  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Study the interaction of our libraries with all hardware (GPU, CPU, Networking) and software components in the stack.
  • Evaluate proof-of-concepts and conduct trade-off analysis for multiple solutions.
  • Triage and root-cause performance issues reported by customers.
  • Collect and analyze performance data; build tools and infrastructure for visualization and analysis.
  • Collaborate with a dynamic, multi-time zone team.

Requirements

  • M.S. (or equivalent experience) or Ph.D. in Computer Science or a related field.
  • Relevant performance engineering and HPC experience.
  • 3+ years of experience with parallel programming.
  • Experience with at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
  • Experience conducting performance benchmarking and triage on large-scale HPC clusters.
  • Good understanding of computer system architecture, hardware-software interactions, and operating system principles.
  • Ability to implement micro-benchmarks in C/C++ and modify codebases.
  • Proficiency in debugging performance issues across the entire hardware/software stack.
  • Proficiency in a scripting language, preferably Python.
  • Familiarity with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
  • Adaptability and passion for learning new areas and tools.
  • Flexibility to work and communicate effectively across different teams and time zones.

Ways to Stand Out

  • Practical experience with InfiniBand/Ethernet networks (RDMA, topologies, congestion control).
  • Experience debugging network issues in large-scale deployments.
  • Familiarity with CUDA programming and/or GPUs.
  • Experience with Deep Learning Frameworks such as PyTorch, TensorFlow.

Company Information

NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. Our teams consist of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all.

Skills

HPC
Performance Engineering
GPU
Deep Learning
NCCL
NVSHMEM
GPUDirect
NVLink
PCIe
Infiniband
Ethernet
Performance Analysis
Root Cause Analysis
Data Visualization
Tool Development
Infrastructure Development

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI