Sr. Staff Software Engineer - High Performance GPU Inference Systems
About Groq
Groq delivers fast, efficient AI inference. Our LPU-based system powers GroqCloud™, giving businesses and developers the speed and scale they need. Headquartered in Silicon Valley, we are on a mission to make high performance AI compute more accessible and affordable. When real-time AI is within reach, anything is possible. Build fast.
Position Overview
Push the limits of heterogeneous GPU environments, dynamic global scheduling, and end-to-end system performance—all while running code as close to the metal as possible.
Responsibilities & Opportunities
- Distributed Systems Engineering: Design and implement scalable, low-latency runtime systems that coordinate thousands of GPUs across tightly integrated, software-defined infrastructure.
- Low-Level GPU Optimization: Build deterministic, hardware-aware abstractions optimized for CUDA, ROCm, or vendor-specific toolchains, ensuring ultra-efficient execution, fault isolation, and reliability.
- Performance & Diagnostics: Develop profiling, observability, and diagnostics tooling for real-time insights into GPU utilization, memory bottlenecks, and latency deviations—continuously improving system SLOs.
- Next-Gen Enablement: Future-proof the stack to support evolving GPU architectures (e.g., H100, MI300), NVLink/Fabric topologies, and multi-accelerator systems (including FPGAs or custom silicon).
- Cross-Functional Collaboration: Work closely with teams across ML compilers, orchestration, cloud infrastructure, and hardware ops to ensure architectural alignment and unlock joint performance wins.
Ideal Candidate Profile
Must-Haves:
- Proven ability to ship high-performance, production-grade distributed systems and maintain large-scale GPU production deployments.
- Deep knowledge of GPU architecture (memory hierarchies, streams, kernels), OS internals, parallel algorithms, and HW/SW co-design principles.
- Proficient in systems languages such as C++ (CUDA), Python, or Rust—with fluency in writing hardware-aware code.
- Obsessed with performance profiling, GPU kernel tuning, memory coalescing, and resource-aware scheduling.
- Passionate about automation, testability, and continuous integration in large-scale systems.
- Comfortable navigating across stack layers—from GPU drivers and kernels to orchestration layers and inference serving.
- Strong communicator, pragmatic problem-solver, and builder of clean, sustainable code.
- Ownership-driven mindset—your code runs fast, scales gracefully, and meets real-world requirements.
Nice to Have:
- Experience operating large-scale GPU inference systems in production (e.g., Triton, TensorRT, or custom GPU services).
- Deploying and optimizing ML/HPC workloads on GPU clusters (Kubernetes, Slurm, Ray, etc.).
- Hands-on experience with multi-GPU training/inference frameworks (e.g., PyTorch DDP, DeepSpeed, or JAX).
- Familiarity with compiler tooling (e.g., TVM, MLIR, XLA) or deep learning graph optimization.
- Successful track record of delivering technically ambitious projects in fast-paced environments.
Attributes of a Groqster
- Humility: Egos are checked at the door.
- Collaborative & Team Savvy: We make up the smartest person in the room, together.
- Growth & Giver Mindset: Learn it all versus know it all; we share knowledge generously.
- Curious & Innovative: Take a creative approach to projects, problems, and design.
- Passion, Grit, & Boldness: No-limit thinking, fueling informed risk-taking.
Compensation & Benefits
- Salary Range: $248,710 - $292,600 (based on skills, qualifications, experience, and internal benchmarks).
- Package: Competitive base salary, equity, and benefits.
Location
- Some roles may require being located near or on our primary sites, as indicated in the job description.
If this sounds like you, we’d love to hear from you!