Senior Server RAS Engineer at NVIDIA

Bengaluru, Karnataka, India

NVIDIA Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Technology, AI, SemiconductorsIndustries

Requirements

  • BS, MS, or PhD or equivalent experience in EE/CS or related field with 10+ years demonstrated experience
  • Strong Python programming in Linux operating environment, strong understanding of Linux kernel internals, strong code review skills
  • Extensive knowledge in system-level architecture invention, reliability engineering, and fault tolerance mechanisms, optimizing RAS architectures for complex computing systems, data centers, or critical applications
  • Proficient in scale-out architectures (hands-on experience a plus)
  • Proficiency in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis)
  • Excellent problem-solving skills, attention to detail, and the ability to analyze complex system-level issues
  • Excellent written and oral communication skills, excellent work ethics, deep sense of collaboration, love to produce quality work, and commitment to finishing tasks every single day
  • Self-starter who loves to find creative solutions to complicated problems
  • Ways to stand out
  • Consistent track record of doing RAS at platform level
  • In-depth understanding of the interaction of machine check architecture and error flows with system firmware/software
  • Hands-on with x86 or ARM system architecture

Responsibilities

  • Design, architect, and deliver server-level RAS for NVIDIA’s data center products
  • Define RAS requirements that ensure compliance with industry standards and customer expectations for scale-out environments
  • Develop fault detection, isolation, and recovery mechanisms to ensure system resilience and minimize downtime
  • Evaluate and select appropriate technologies and components to optimize reliability, availability, and serviceability, considering factors such as MTBF, MTTR, and TCO
  • Collaborate with customers, vendors, and suppliers to assess and integrate their RAS-related solutions into the overall system architecture
  • Conduct system and cluster level simulations, analysis, and testing to validate and verify the effectiveness of the RAS architecture and its components
  • Stay up to date with the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to guide future system designs
  • Work with NVIDIA partners on RAS related architecture and discussions to improve their use of NVIDIA products
  • Work on all phases of product development, from product definition, architecture, and design, through implementation, debugging, testing, and early customer support

Skills

RAS
Reliability Engineering
Availability Engineering
Serviceability Engineering
GPU Systems
Grace Systems
Fault Detection
Fault Isolation
Fault Recovery
MTBF
MTTR
TCO
Server Architecture
Data Center Systems

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI