NVIDIA

Senior Site Reliability Engineer

India

Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, Cloud Computing, Computer Graphics, SemiconductorsIndustries

Senior Site Reliability Engineer - DGX Cloud Engineering

Position Overview

NVIDIA is seeking a passionate Senior Site Reliability Engineer to join our DGX Cloud Engineering Team. In this role, you will be instrumental in shaping the future of AI and GPUs in the Cloud. NVIDIA DGX Cloud is a specialized cloud platform designed for AI tasks, empowering organizations to transition AI projects from development to deployment in the era of intelligent AI. If you are passionate about cloud software development, committed to quality, and excel at building cloud-scale software systems, we invite you to contribute to our mission of delivering GPU-powered services globally.

What You'll Be Doing

  • Play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform.
  • Build our deployment infrastructure processes.
  • Create world-class SRE measurement and automation tools to enhance operational efficiency.
  • Maintain a high standard of perfection in service operability and reliability.
  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.
  • Collaborate closely with other teams on new products, features, and improvements to existing products.
  • Develop, maintain, and improve cloud deployments of our software.
  • Participate in the triage and resolution of complex infrastructure-related issues.
  • Collaborate with development, QA, and Product teams to establish, refine, and streamline our software release process and software observability to ensure service operability, reliability, and availability.
  • Maintain services post-launch by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
  • Develop, maintain, and improve automation tools to enhance the efficiency of SRE operations.
  • Practice balanced incident response and conduct blameless postmortems.
  • Participate in an on-call rotation to support production systems.

What We Need to See

  • BS or MS in Computer Science or an equivalent program from an accredited University/College.
  • 8+ years of hands-on software engineering or equivalent experience.
  • Demonstrated understanding of cloud design principles in areas such as virtualization, global infrastructure, distributed systems, and security.
  • Expertise in Kubernetes (K8s) and KubeVirt, and experience building RESTful web services.
  • Understanding of building AI Agentic solutions, preferably using NVIDIA's open-source AI solutions.
  • Demonstrated working experience with SRE principles, including metrics emission for observability, monitoring, and alerting using logs, traces, and metrics.
  • Hands-on experience with Docker, containers, and Infrastructure as Code (IaC) tools like Terraform for deployment and CI/CD pipelines.
  • Knowledge of concepts related to working with Cloud Service Providers (CSPs), such as AWS (Fargate, EC2, IAM, ECR, EKS, Route53, etc.) and Azure.

Ways to Stand Out

  • Expertise in technologies such as Stack-storm, OpenStack, Red Hat OpenShift, and AI databases like Milvus.
  • A proven track record of solving complex problems with elegant solutions.
  • Prior experience with Go & Python, and React.
  • Demonstrated delivery of complex projects in previous roles.
  • Showcased ability in developing front-end applications with concepts of SSA (Server-Side Analytics) and RBAC (Role-Based Access Control).

Company Information

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Employment Type

Full time

Location Type

Information not provided

Salary

Information not provided

Skills

Site Reliability Engineering
Cloud-based systems
PaaS
IaaS
Automation
Deployment infrastructure
SRE measurement
AI
GPUs
Cloud software development
Scalability
Operability
Reliability

NVIDIA

Designs GPUs and AI computing solutions

About NVIDIA

NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.

Santa Clara, CaliforniaHeadquarters
1993Year Founded
$19.5MTotal Funding
IPOCompany Stage
Automotive & Transportation, Enterprise Software, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Company Equity
401(k) Company Match

Risks

Increased competition from AI startups like xAI could challenge NVIDIA's market position.
Serve Robotics' expansion may divert resources from NVIDIA's core GPU and AI businesses.
Integration of VinBrain may pose challenges and distract from NVIDIA's primary operations.

Differentiation

NVIDIA leads in AI and HPC solutions with cutting-edge GPU technology.
The company excels in diverse markets, including gaming, data centers, and autonomous vehicles.
NVIDIA's cloud services, like CloudXR, offer scalable solutions for AI and machine learning.

Upsides

Acquisition of VinBrain enhances NVIDIA's AI capabilities in the healthcare sector.
Investment in Nebius Group boosts NVIDIA's AI infrastructure and cloud platform offerings.
Serve Robotics' expansion, backed by NVIDIA, highlights growth in autonomous delivery services.

Land your dream remote job 3x faster with AI