Senior Site Reliability Engineer, DGX Cloud
NVIDIAFull Time
Senior (5 to 8 years)
Candidates should possess a Bachelor's Degree in Computer Science or Engineering or equivalent certifications, with 3 to 6 years of software development experience. A strong understanding of SRE and ITIL processes, experience with incident and change handling in production environments, and proficiency in scripting languages like PowerShell, Bash, Perl, or Python are required. Experience with Terraform, cloud platforms (preferably GCP), K8s (GKE), and Microsoft/Red Hat Linux server technologies, including SSL certificate management and file transfer protocols, is essential. Familiarity with CI/CD pipelines, monitoring tools, and security best practices is also necessary.
The Site Reliability Engineer will contribute to proactive monitoring and automation, drive compliance and security efforts, and develop/maintain installation and configuration procedures. Responsibilities include advancing the DevOps discipline, evaluating manual support, recommending DevOps tooling, and troubleshooting system problems. The role involves maintaining and improving DevOps methodology and CI/CD pipelines, resolving job failures within SLAs, and addressing security vulnerabilities. Additionally, the engineer will manage application migrations from on-prem to the cloud and perform post-deployment validation checks.
Payment technologies and software solutions