Senior Site Reliability Engineer, DGX Cloud
NVIDIAFull Time
Senior (5 to 8 years)
Candidates should have over 4 years of experience in Site Reliability, DevOps, or Cloud Engineering roles, with proven experience in large-scale production environments. Deep proficiency in AWS services such as EKS, ECS, EC2, S3, RDS, and Lambda is required, along with extensive experience managing production workloads in the cloud. Proficiency in application observability, monitoring, and logging, including hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog, is essential. Experience with Infrastructure as Code (IaC) is also required.
The Site Reliability Engineer will design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry, and leverage AIOps principles for enhanced anomaly detection and predictive alerting. Responsibilities include designing, implementing, and conducting Chaos Engineering experiments to proactively identify and remediate system weaknesses, partnering with software engineering teams to architect for high availability and fault tolerance, and defining, measuring, and evangelizing Service Level Indicators (SLIs) and Service Level Objectives (SLOs). The role also involves analyzing performance metrics and distributed traces to resolve latency bottlenecks, implementing cost optimization strategies, identifying and automating manual operational tasks, enhancing and maintaining Infrastructure as Code (IaC) modules with Terraform, and improving CI/CD pipelines for safe and automated deployments.
Provides financial information and analytics services
S&P Global provides financial information and analytics to a wide range of clients, including investors, corporations, and governments. The company offers services such as credit ratings, market intelligence, and indices, which help clients understand and navigate the global financial market. S&P Global's products work by utilizing advanced data analytics and research to deliver insights that assist clients in making informed decisions and managing risks. Unlike many competitors, S&P Global has a diverse range of divisions, including S&P Global Ratings and S&P Dow Jones Indices, which allows it to cater to various financial needs. The company's goal is to support clients in driving growth while also committing to corporate responsibility and positive societal impact.