Inference Software Engineer - Collectives at Etched.ai

San Jose, California, United States

Apply Now

$175,000 – $275,000Compensation

Junior (1 to 2 years)Experience Level

Full TimeJob Type

UnknownVisa

Artificial Intelligence, Semiconductor, SoftwareIndustries

Requirements

Candidates should possess strong proficiency in Rust and/or C++, familiarity with PyTorch and/or JAX, and experience designing/optimizing collectives such as NCCL, MPI collectives, and XLA collectives. Solid systems knowledge, including Linux internals, accelerator architectures (e.g., GPUs, TPUs), high-speed interconnects (e.g., NVLink, InfiniBand) and RDMAS are required, along with a strong understanding of distributed systems concepts, algorithms, and challenges. Experience analyzing performance traces and logs from distributed systems and ML workloads is also necessary.

Responsibilities

The Inference Software Engineer - Collectives will formalize and optimize collectives (e.g. Send/Recieve, AllReduce, Broadcast, etc.), collaborate across systems and research teams to bring MoE architectures to Sohu’s runtime, optimize expert routing and communication layers using Sohu’s collectives, contribute to scaling and enhancing Sohu’s runtime, including multi-node inference, intra-node execution, state management, and robust error handling, and develop tools for performance profiling and debugging, identifying bottlenecks and correctness issues.

Skills

Rust

C++

PyTorch

JAX

Systems Knowledge

Linux Internals

Accelerator Architectures

High-Speed Interconnects

NCCL

MPI Collectives

XLA Collectives

Performance Profiling

Debugging

Etched.ai

Develops servers for transformer inference

About Etched.ai

The company specializes in developing powerful servers for transformer inference, utilizing transformer architecture integrated into their chips to achieve highly efficient and advanced technology. The main technologies used in the product are transformer architecture and advanced chip integration.

Cupertino, CA, USAHeadquarters

2022Year Founded

$5.4MTotal Funding

SEEDCompany Stage

HardwareIndustries

11-50Employees