Senior System Software Engineer - Dynamo and Triton Inference Server
NVIDIAFull Time
Senior (5 to 8 years)
Candidates must have a proven ability to ship high-performance, production-grade distributed systems and maintain large-scale GPU production deployments. Deep knowledge of GPU architecture, OS internals, parallel algorithms, and HW/SW co-design principles is essential. Proficiency in systems languages such as C++ (CUDA), Python, or Rust, with fluency in writing hardware-aware code, is required. Candidates should be obsessed with performance profiling, GPU kernel tuning, memory coalescing, and resource-aware scheduling, and passionate about automation, testability, and continuous integration in large-scale systems. Comfort navigating across stack layers, from GPU drivers and kernels to orchestration layers and inference serving, is necessary. Strong communication, pragmatic problem-solving skills, and the ability to build clean, sustainable code are also key. An ownership-driven mindset is expected. Experience operating large-scale GPU inference systems, deploying and optimizing ML/HPC workloads on GPU clusters, hands-on experience with multi-GPU training/inference frameworks, and familiarity with compiler tooling are considered nice to have.
The Sr. Staff Software Engineer will push the limits of heterogeneous GPU environments, dynamic global scheduling, and end-to-end system performance by running code as close to the metal as possible. This includes designing and implementing scalable, low-latency runtime systems for coordinating thousands of GPUs, building deterministic, hardware-aware abstractions optimized for CUDA, ROCm, or vendor-specific toolchains, and developing profiling, observability, and diagnostics tooling for real-time insights into GPU utilization, memory bottlenecks, and latency deviations. Responsibilities also involve future-proofing the stack to support evolving GPU architectures and multi-accelerator systems, and collaborating closely with ML compilers, orchestration, cloud infrastructure, and hardware ops teams to ensure architectural alignment and unlock joint performance wins.
AI inference technology for scalable solutions
Groq specializes in AI inference technology, providing the Groq LPU™, which is known for its high compute speed, quality, and energy efficiency. The Groq LPU™ is designed to handle AI processing tasks quickly and effectively, making it suitable for both cloud and on-premises applications. Unlike many competitors, Groq's products are designed, fabricated, and assembled in North America, which helps maintain high standards of quality and performance. The company targets a variety of clients across different industries that require fast and efficient AI processing capabilities. Groq's goal is to deliver scalable AI inference solutions that meet the growing demands for rapid data processing in the AI and machine learning market.