LLM Inference Engineer at Hippocratic AI

Palo Alto, California, United States

Hippocratic AI Logo
Not SpecifiedCompensation
Mid-level (3 to 4 years)Experience Level
Full TimeJob Type
UnknownVisa
HealthcareIndustries

Requirements

  • Extensive hands-on experience with state-of-the-art inference optimization techniques
  • A track record of deploying efficient, scalable LLM systems in production environments
  • 2+ years of experience optimizing LLM inference systems at scale
  • Proven expertise with distributed serving architectures for large language models
  • Hands-on experience implementing quantization techniques for transformer models
  • Strong understanding of modern inference optimization methods, including speculative decoding techniques with draft models and Eagle speculative decoding approaches
  • Proficiency in Python and C++
  • Experience with CUDA programming and GPU optimization (familiarity required, expert-level not necessary)
  • Contributions to open-source inference frameworks such as vLLM, SGLang, or TensorRT-LLM (preferred)
  • Experience with custom CUDA kernels (preferred)
  • Track record of deploying inference systems in production environments (preferred)
  • Deep understanding of performance optimization systems (preferred)

Responsibilities

  • Design and implement multi-node serving architectures for distributed LLM inference
  • Optimize multi-LoRA serving systems
  • Apply advanced quantization techniques (FP4/FP6) to reduce model footprint while preserving quality
  • Implement speculative decoding and other latency optimization strategies
  • Develop disaggregated serving solutions with optimized caching strategies for prefill and decoding phases
  • Continuously benchmark and improve system performance across various deployment scenarios and GPU types

Skills

Key technologies and capabilities for this role

PythonC++CUDALLM InferenceQuantizationSpeculative DecodingDistributed ServingMulti-LoRAGPU OptimizationTransformer Models

Questions & Answers

Common questions about this position

Is this position remote?

Yes, this is a remote position.

What is the salary for this LLM Inference Engineer role?

This information is not specified in the job description.

What skills are required for this role?

Required skills include 2+ years optimizing LLM inference systems at scale, expertise with distributed serving architectures, hands-on experience with quantization techniques for transformer models, proficiency in Python and C++, and experience with CUDA programming and GPU optimization.

What is the company culture like at Hippocratic AI?

The team comprises ex-researchers from Microsoft, Meta, Nvidia, Apple, Stanford, John Hopkins, and HuggingFace, focused on building safety-focused LLMs for healthcare with visionary leadership and strong backing from top investors.

What makes a strong candidate for this position?

A strong candidate has extensive hands-on experience with state-of-the-art inference optimization techniques and a track record of deploying efficient, scalable LLM systems in production, plus preferred experience with open-source frameworks like vLLM or custom CUDA kernels.

Hippocratic AI

About Hippocratic AI

N/AHeadquarters
N/AYear Founded
N/ACompany Stage

Land your dream remote job 3x faster with AI