[Remote] Senior Solutions Architect, HPC and AI at NVIDIA

United Kingdom

NVIDIA Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Technology, AI, High Performance ComputingIndustries

Requirements

  • BS, MS, PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or a related engineering field—or equivalent practical experience
  • 8+ years of experience in accelerated computing technologies at cluster scale, ideally including work with NVIDIA platforms
  • Strong programming skills in at least one of the following languages: C, C++, or Python
  • Practical experience identifying and resolving bottlenecks in large-scale training workloads or parallel applications
  • Hands-on experience in profiling and debugging large parallel applications
  • Solid understanding of CPU and GPU architectures, CUDA, parallel filesystems, and high-speed interconnects
  • Experienced in working with large compute clusters with an understanding of their internal scheduling and resource management mechanisms (e.g. SLURM or Cloud based clusters)
  • Proficient knowledge of training pipelines and frameworks, encompassing their internal operations and performance attributes
  • Ways To Stand Out
  • Experience in debugging training pipelines running on thousands of GPUs in production environment
  • Hands-on experience with performance profiling and optimizations using tools like Nsight Systems, Nsight Compute and good understanding of NCCL, MPI and low-level communication libraries
  • Ability to debug stability issues across the entire stack: parallel application, training frameworks, runtime libraries, schedulers, and hardware
  • Solid understanding of the internal workings of LLM frameworks such as PyTorch, Megatron-LM, or NeMo, and how they affect compute layers like CPUs, GPUs, network and storage or understanding of inference tools such as vLLM, Dynamo, TensorRT-LLM, RedHat Inference Server or SGLang

Responsibilities

  • Collaborating with NVIDIA’s training framework developers and product teams to stay ahead of the latest features and help partners to adopt them effectively
  • Assisting with deployment, debugging, and improving the efficiency of AI workloads on extensive NVIDIA platforms
  • Benchmarking new framework features, analyzing performance, and sharing actionable insights with both customers and internal teams
  • Working directly with external customers to solve cluster performance and stability issues, identify bottlenecks, and implement effective solutions
  • Build expertise and guide customers in scaling workloads efficiently and reliably on the latest generation of NVIDIA GPUs
  • Contributing to Europe’s Sovereign AI initiative by helping customers implement advanced resiliency features within AI training pipelines

Skills

GPU
HPC
AI
Training Workloads
Inference
NVIDIA
Cluster Scale
Benchmarking
Performance Optimization
Debugging
Resiliency

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI