Head of Inference Kernels at Etched.ai

San Jose, California, United States

Etched.ai Logo
Not SpecifiedCompensation
Expert & Leadership (9+ years), Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
AI, Semiconductors, HardwareIndustries

Requirements

  • Experience in designing and optimizing GPU kernels for deep learning on GPUs using CUDA, and assembly (ASM). Experience with low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
  • Deep fluency with transformer inference architecture, optimization levers, and full-stack systems (e.g., vLLM, custom runtimes). History of delivering tangible perf wins on GPU hardware or custom AI accelerators
  • Solid understanding of roofline models of compute throughput, memory bandwidth and interconnect performance
  • Experienced in running large-scale workloads on heterogeneous compute clusters, optimizing for efficiency and scalability of AI workloads
  • Scopes projects crisply, sets aggressive but realistic milestones, and drives technical decision-making across the team. Anticipates blockers and shifts resources proactively

Responsibilities

  • Architect Best-in-Class Inference Performance on Sohu: Deliver continuous batching throughput exceeding B200 by ≥10x on priority workloads
  • Develop Best-in-Performance Inference Mega Kernels: Develop complex, fused kernels (including basics like reordering and fusing, but also more complex work involving simultaneous computation and transmission of intermediate values for sequential matmuls) that increase chip utilization and reduce inference latency, and validate these optimizations through benchmarking and regression-tested in production pipelines
  • Architect Model Mapping Strategies: Develop system level optimizations using a mix of techniques such tensor parallelism and expert parallelism for optimal performance
  • Hardware-Software Co-design of Inference-time Algorithmic Innovation: Develop and deploy production-ready inference-time algorithmic improvements (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading)
  • Build Scalable Team and Roadmap: Grow and retain a team of high-performing inference optimization engineers
  • Cross-Functional Performance Alignment: Ensure inference stack and performance goals are aligned with the software infrastructure teams (e.g., runtime, and scheduling support), GTM (e.g., latency SLAs, workload targets) and hardware teams (e.g., instruction design, memory bandwidth) for future generations of our hardware

Skills

Inference Kernels
Transformer Models
Continuous Batching
Fused Kernels
Tensor Parallelism
Expert Parallelism
Hardware-Software Co-design
Speculative Decoding
Parallel Decoding
Prefill-Decode Disaggregation
Matmuls
Benchmarking
ASIC
Llama-3
Deepseek-R1
Qwen-3
Stable Diffusion

Etched.ai

Develops servers for transformer inference

About Etched.ai

The company specializes in developing powerful servers for transformer inference, utilizing transformer architecture integrated into their chips to achieve highly efficient and advanced technology. The main technologies used in the product are transformer architecture and advanced chip integration.

Cupertino, CA, USAHeadquarters
2022Year Founded
$5.4MTotal Funding
SEEDCompany Stage
HardwareIndustries
11-50Employees

Land your dream remote job 3x faster with AI