Director, Global Network Reliability Engineering at NVIDIA

Santa Clara, California, United States

NVIDIA Logo
Not SpecifiedCompensation
Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
Technology, AI, NetworkingIndustries

Requirements

  • Bachelor’s degree in Computer Science, related technical field, or equivalent experience
  • Experience building and growing teams that are geographically distributed, appreciate local operations and bring in a global perspective, following standards
  • Ability to do technical deep-dives into code, networking, operating systems, and storage, as well as being verbally and cognitively agile enough to hold your own in strategy discussions with NVIDIA’s executive team and peer SMEs
  • Ability to identify trends and promote solutions that solve challenges efficiently across multiple product areas
  • Excellent innovative thinking, collaboration, and problem-solving skills
  • 12+ overall years of experience with system design, network architecture, network engineering, and network operations
  • 7+ years of leadership experience

Responsibilities

  • Mature the current support model and processes to a more data driven, automated, SRE model
  • Build an in-house team of reliability experts for networking support and operations from the existing outsourced SMEs, providing leadership, direction, and strategy for a growing team
  • Set the technical vision, strategy, and roadmap for network operations in partnership with key infrastructure and partner teams
  • Work across Network Architecture, Network engineering and partner well to establish run books, regular training sessions and ensure the network is built to be self-healing
  • Understand RCAs from events and incidents and work with AI operations to enrich observability tooling for better full stack view of the network to applications
  • Influence the architecture of NVIDIA networks both on-prem and in the clouds
  • Lead NVIDIA’s global network operations, ensuring reliability, scalability and efficiency goals are defined and met
  • Lead a team of network reliability engineers to bring in a data driven approach to operations, with focus on observability, well defined success metrics, and making continuous improvements
  • Lead design and automation of all operations, provide architectural input based on outage patterns and observability trends and be the keeper of excellence in Networking
  • Leverage an execution mindset to facilitate effective translation of strategic plans to incremental delivery of impact to the business

Skills

Network Reliability Engineering
SRE
Network Operations
Observability
Automation
Data Centers
Global Operations
Leadership
Strategic Planning
Problem-Solving
Communication

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI