Senior DGX Cloud Performance Engineer at NVIDIA

Santa Clara, California, United States

NVIDIA Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
AI, Cloud Computing, TechnologyIndustries

Requirements

  • 12+ years of proven experience
  • Ability to work with large scale parallel and distributed accelerator-based systems
  • Expertise optimizing performance and AI workloads on large scale systems
  • Experience with performance modeling and benchmarking at scale
  • Strong background in Computer Architecture, Networking, Storage systems, Accelerators
  • Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others
  • Experience with AI/ML models and workloads, in particular LLMs
  • Understanding of DNNs and their use in emerging AI/ML applications and services
  • Bachelors or Masters in Engineering (preferably, Electrical Engineering, Computer Engineering, or Computer Science) or equivalent experience
  • Proficiency in Python, C/C++
  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)

Responsibilities

  • Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of meaningful applications and services
  • Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective
  • Develop ideas on how to improve the end to end system performance and usability by leading changes in the HW or SW (or both)
  • Collaborate with external CSPs during the full life cycle of cluster deployment and workload optimization to understand and drive standard methodologies
  • Collaborate with AI researchers, developers, and application service providers to understand difficulties, requirements, project future needs and share best practices
  • Work with a diverse set of LLM workloads and their application areas such as health care, climate modeling, pharmaceuticals, financial futures, Genomics/Drug discovery, among others
  • Develop the vital modeling framework and the TCO analysis to enable efficient exploration and sweep of the architecture and design space
  • Develop the methodology needed to drive the engineering analysis to advise the architecture, design and roadmap of DGX Cloud

Skills

Distributed Systems
Parallel Systems
Performance Analysis
Benchmark Development
Performance Optimization
AI Workloads
LLM Workloads
HW-SW Co-design
Cluster Architecture
NVIDIA DGX
Performance Profiling

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI