Senior Site Reliability Engineer, DGX Cloud
NVIDIAFull Time
Senior (5 to 8 years)
Candidates should have 4 years of professional experience, demonstrable Linux familiarity, and experience with Kubernetes/Docker/Containers. Experience with any major cloud provider (AWS, GCP, Azure) is required, along with motivation to learn, commitment to excellence, problem-solving and troubleshooting abilities, and excellent written and verbal communication skills. Previous experience working directly with customers, experience with DevOps, contributions to open-source projects, and experience with Splunk or Prometheus are considered bonus points.
The Customer Reliability Engineer will operate, monitor, and maintain the platform to ensure availability, predictability, and reliable operations. They will learn and build expertise in Kubernetes, cloud engineering, and cloud networking, and work on a modern, cloud-native product. Responsibilities include creating strong relationships with customers, helping them achieve their reliability goals, providing feedback to shape product direction, owning the customer experience, prioritizing and solving issues, meeting SLAs, and participating in a pager rotation for 24x7 coverage. Up to 20% of time can be spent on side projects such as contributing to the open-source Airflow repository or developing internal monitoring and alerting systems.
Data orchestration platform for pipeline management
Astronomer.io provides a data orchestration platform that utilizes Apache Airflow to simplify the deployment of data pipelines. Its main product, Astro, helps businesses manage and monitor their data flows, allowing them to focus on delivering essential data pipelines. The platform supports data unification across various clouds and offers over 1500 integrations, making it suitable for data and machine learning teams in industries like finance and e-commerce. Astronomer.io distinguishes itself from competitors by offering enterprise-grade security, zero-downtime upgrades for Airflow, and tools for monitoring pipeline health, which enhance compute efficiency and reduce delays in task scheduling. The company's goal is to empower organizations to optimize their data strategies and achieve a significant return on investment by ensuring their applications operate with maximum reliability.