LLM Inference Engineer at Hippocratic AI

Palo Alto, California, United States

Apply Now

Not SpecifiedCompensation

Mid-level (3 to 4 years)Experience Level

Full TimeJob Type

UnknownVisa

HealthcareIndustries

Requirements

Extensive hands-on experience with state-of-the-art inference optimization techniques
A track record of deploying efficient, scalable LLM systems in production environments
2+ years of experience optimizing LLM inference systems at scale
Proven expertise with distributed serving architectures for large language models
Hands-on experience implementing quantization techniques for transformer models
Strong understanding of modern inference optimization methods, including speculative decoding techniques with draft models and Eagle speculative decoding approaches
Proficiency in Python and C++
Experience with CUDA programming and GPU optimization (familiarity required, expert-level not necessary)
Contributions to open-source inference frameworks such as vLLM, SGLang, or TensorRT-LLM (preferred)
Experience with custom CUDA kernels (preferred)
Track record of deploying inference systems in production environments (preferred)
Deep understanding of performance optimization systems (preferred)

Responsibilities

Design and implement multi-node serving architectures for distributed LLM inference
Optimize multi-LoRA serving systems
Apply advanced quantization techniques (FP4/FP6) to reduce model footprint while preserving quality
Implement speculative decoding and other latency optimization strategies
Develop disaggregated serving solutions with optimized caching strategies for prefill and decoding phases
Continuously benchmark and improve system performance across various deployment scenarios and GPU types

Skills

Key technologies and capabilities for this role

PythonC++CUDALLM InferenceQuantizationSpeculative DecodingDistributed ServingMulti-LoRAGPU OptimizationTransformer Models

Questions & Answers

Common questions about this position

Is this position remote?

Yes, this is a remote position.

What is the salary for this LLM Inference Engineer role?

This information is not specified in the job description.

What skills are required for this role?

Required skills include 2+ years optimizing LLM inference systems at scale, expertise with distributed serving architectures, hands-on experience with quantization techniques for transformer models, proficiency in Python and C++, and experience with CUDA programming and GPU optimization.

What is the company culture like at Hippocratic AI?

The team comprises ex-researchers from Microsoft, Meta, Nvidia, Apple, Stanford, John Hopkins, and HuggingFace, focused on building safety-focused LLMs for healthcare with visionary leadership and strong backing from top investors.

What makes a strong candidate for this position?

A strong candidate has extensive hands-on experience with state-of-the-art inference optimization techniques and a track record of deploying efficient, scalable LLM systems in production, plus preferred experience with open-source frameworks like vLLM or custom CUDA kernels.