NVIDIA

Senior System Software Engineer, NCCL - Partner Enablement

Germany

Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Semiconductors, High Performance Computing, Artificial Intelligence, Deep LearningIndustries

Position Overview

  • Location Type:
  • Job Type: Full time
  • Salary:

NVIDIA is a leader in Artificial Intelligence, High Performance Computing, and Visualization. Our invention, the GPU, is central to modern computing and powers groundbreaking innovations from AI to autonomous vehicles. We are seeking a motivated Partner Enablement Engineer to guide key partners and customers in utilizing NCCL, a crucial GPU communication library for scaling Deep Learning and HPC applications. This role offers an end-to-end understanding of the AI networking stack, working with high-speed networking technologies like Infiniband, RoCE, and Ethernet.

Responsibilities

  • Engage with partners and customers to diagnose and resolve functional and performance issues with NCCL.
  • Conduct performance characterization and analysis of NCCL and Deep Learning applications on advanced GPU clusters.
  • Develop tools and automation to isolate issues across new systems and platforms, including cloud environments (Azure, AWS, GCP).
  • Provide guidance to customers and support teams on HPC best practices for multi-node cluster applications.
  • Document and deliver training sessions and webinars on NCCL.
  • Collaborate with internal teams across different time zones on networking, GPUs, storage, infrastructure, and support.

Requirements

  • B.S./M.S. degree in Computer Science/Computer Engineering or equivalent experience with 5+ years of relevant experience.
  • Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design.
  • Experience working with the engineering or academic research community supporting HPC or AI.
  • Practical experience with high-performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, and congestion control.
  • Expertise in Linux fundamentals and a scripting language, preferably Python.
  • Familiarity with containers, cloud provisioning, and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible).
  • Adaptability and a passion for learning new areas and tools.
  • Flexibility to work and communicate effectively across different teams and time zones.

Ways to Stand Out

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters.
  • Prior system administration experience, especially for large clusters.
  • Experience debugging network configuration issues in large-scale deployments.
  • Familiarity with CUDA programming and/or GPUs.
  • Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks like PyTorch, TensorFlow.
  • Deep understanding of technology and a passion for your work.

Company Information

NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. Our teams are composed of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all.

Skills

Parallel Programming
MPI
NCCL
UCX
NVSHMEM
C/C++
Performance Analysis
Networking
HPC
Cloud Platforms (Azure, AWS, GCP)

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI