Inference Software Engineer - Collectives at Etched.ai

San Jose, California, United States

Etched.ai Logo
$175,000 – $275,000Compensation
Junior (1 to 2 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, Semiconductor, SoftwareIndustries

Requirements

Candidates should possess strong proficiency in Rust and/or C++, familiarity with PyTorch and/or JAX, and experience designing/optimizing collectives such as NCCL, MPI collectives, and XLA collectives. Solid systems knowledge, including Linux internals, accelerator architectures (e.g., GPUs, TPUs), high-speed interconnects (e.g., NVLink, InfiniBand) and RDMAS are required, along with a strong understanding of distributed systems concepts, algorithms, and challenges. Experience analyzing performance traces and logs from distributed systems and ML workloads is also necessary.

Responsibilities

The Inference Software Engineer - Collectives will formalize and optimize collectives (e.g. Send/Recieve, AllReduce, Broadcast, etc.), collaborate across systems and research teams to bring MoE architectures to Sohu’s runtime, optimize expert routing and communication layers using Sohu’s collectives, contribute to scaling and enhancing Sohu’s runtime, including multi-node inference, intra-node execution, state management, and robust error handling, and develop tools for performance profiling and debugging, identifying bottlenecks and correctness issues.

Skills

Rust
C++
PyTorch
JAX
Systems Knowledge
Linux Internals
Accelerator Architectures
High-Speed Interconnects
NCCL
MPI Collectives
XLA Collectives
Performance Profiling
Debugging

Etched.ai

Develops servers for transformer inference

About Etched.ai

The company specializes in developing powerful servers for transformer inference, utilizing transformer architecture integrated into their chips to achieve highly efficient and advanced technology. The main technologies used in the product are transformer architecture and advanced chip integration.

Cupertino, CA, USAHeadquarters
2022Year Founded
$5.4MTotal Funding
SEEDCompany Stage
HardwareIndustries
11-50Employees

Land your dream remote job 3x faster with AI