Head of Inference Kernels at Etched.ai

San Jose, California, United States

Apply Now

Not SpecifiedCompensation

Expert & Leadership (9+ years), Senior (5 to 8 years)Experience Level

Full TimeJob Type

UnknownVisa

AI, Semiconductors, HardwareIndustries

Requirements

Experience in designing and optimizing GPU kernels for deep learning on GPUs using CUDA, and assembly (ASM). Experience with low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Deep fluency with transformer inference architecture, optimization levers, and full-stack systems (e.g., vLLM, custom runtimes). History of delivering tangible perf wins on GPU hardware or custom AI accelerators
Solid understanding of roofline models of compute throughput, memory bandwidth and interconnect performance
Experienced in running large-scale workloads on heterogeneous compute clusters, optimizing for efficiency and scalability of AI workloads
Scopes projects crisply, sets aggressive but realistic milestones, and drives technical decision-making across the team. Anticipates blockers and shifts resources proactively

Responsibilities

Architect Best-in-Class Inference Performance on Sohu: Deliver continuous batching throughput exceeding B200 by ≥10x on priority workloads
Develop Best-in-Performance Inference Mega Kernels: Develop complex, fused kernels (including basics like reordering and fusing, but also more complex work involving simultaneous computation and transmission of intermediate values for sequential matmuls) that increase chip utilization and reduce inference latency, and validate these optimizations through benchmarking and regression-tested in production pipelines
Architect Model Mapping Strategies: Develop system level optimizations using a mix of techniques such tensor parallelism and expert parallelism for optimal performance
Hardware-Software Co-design of Inference-time Algorithmic Innovation: Develop and deploy production-ready inference-time algorithmic improvements (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading)
Build Scalable Team and Roadmap: Grow and retain a team of high-performing inference optimization engineers
Cross-Functional Performance Alignment: Ensure inference stack and performance goals are aligned with the software infrastructure teams (e.g., runtime, and scheduling support), GTM (e.g., latency SLAs, workload targets) and hardware teams (e.g., instruction design, memory bandwidth) for future generations of our hardware

Skills

Key technologies and capabilities for this role

Inference KernelsTransformer ModelsContinuous BatchingFused KernelsTensor ParallelismExpert ParallelismHardware-Software Co-designSpeculative DecodingParallel DecodingPrefill-Decode DisaggregationMatmulsBenchmarkingASICLlama-3Deepseek-R1Qwen-3Stable Diffusion

Questions & Answers

Common questions about this position

Is this position remote or onsite?

The position is onsite.

What is the salary range for this role?

This information is not specified in the job description.

What key skills and experiences are needed for the Head of Inference Kernels role?

The role requires expertise in architecting inference performance, developing fused kernels for transformers, model mapping strategies like tensor and expert parallelism, hardware-software co-design for inference algorithms, and building scalable high-performance teams.

What is the company culture like at Etched.ai?

Etched.ai features a high-performing team of leading engineers focused on pioneering AI inference innovations, backed by top-tier investors, with an emphasis on delivering exceptional performance and rapid development cycles.

What makes a strong candidate for this position?

Strong candidates have experience leading teams to optimize inference kernels for transformer models, implementing advanced techniques like fused kernels and parallelism, and delivering production-ready implementations rapidly, as indicated by the 'You may be a good fit if you have' section.

Etched.ai

Develops servers for transformer inference

About Etched.ai

The company specializes in developing powerful servers for transformer inference, utilizing transformer architecture integrated into their chips to achieve highly efficient and advanced technology. The main technologies used in the product are transformer architecture and advanced chip integration.

Cupertino, CA, USAHeadquarters

2022Year Founded

$5.4MTotal Funding

SEEDCompany Stage

HardwareIndustries

11-50Employees