Senior SRE Engineer
Phantom- Full Time
- Senior (5 to 8 years)
Candidates should possess a Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field, along with 5+ years of proven experience. They must have a solid programming background in Python/Go or similar scripting languages, excellent debugging and problem-solving skills, and a strong understanding of architectural requirements and development processes. Proficiency in configuration management and Infrastructure-as-Code (IaC) tools like Ansible, Puppet, Chef, and Terraform is required, as well as expertise in Kubernetes architecture, networking, RBAC, and persistent storage solutions. Familiarity with secret management tools such as HashiCorp Vault or AWS Secrets Manager is also necessary, alongside proficiency in data analytics/visualization and monitoring tools like Kibana, Grafana, Splunk, Zabbix, and Prometheus.
The Senior Site Reliability Engineer will be responsible for the end-to-end implementation of Kubernetes architecture, including design, deployment, hardening, networking, and scaling. They will implement high availability clusters and disaster recovery solutions, utilize Configuration as Code and infrastructure-as-code tools, design and implement logging and monitoring solutions, craft and develop tools for automating workflows, and leverage AI techniques to extract insights from machine and job data. The role involves participating in prototyping cloud infrastructure, providing on-call support and critical issue coverage, and collaborating with various business units within NVIDIA Software to cater to their infrastructure and system needs.
Designs GPUs and AI computing solutions
NVIDIA designs and manufactures graphics processing units (GPUs) and system on a chip units (SoCs) for various markets, including gaming, professional visualization, data centers, and automotive. Their products include GPUs tailored for gaming and professional use, as well as platforms for artificial intelligence (AI) and high-performance computing (HPC) that cater to developers, data scientists, and IT administrators. NVIDIA generates revenue through the sale of hardware, software solutions, and cloud-based services, such as NVIDIA CloudXR and NGC, which enhance experiences in AI, machine learning, and computer vision. What sets NVIDIA apart from competitors is its strong focus on research and development, allowing it to maintain a leadership position in a competitive market. The company's goal is to drive innovation and provide advanced solutions that meet the needs of a diverse clientele, including gamers, researchers, and enterprises.