Pod Software Engineer at Etched.ai

Cupertino, California, United States

Etched.ai Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, High Performance ComputingIndustries

Requirements

  • Strong programming skills in C/C++
  • Experience with at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies, including RDMA, RoCE, GPUDirect, queue pairs, completion queues, and transport types
  • Solid understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)

Responsibilities

  • Design, develop, and implement RDMA-based networking peering for high bandwidth and low latency communication
  • Work across operating systems, kernel drivers, embedded software, and system software
  • Develop tests to qualify host processors, NICs, and device network interfaces
  • Furnish burn-in teams with tests representing real-world use cases and extreme-load stress testing
  • Define key metrics for system software to collect for high availability and performance
  • Analyze performance deviations and optimize network stack configurations
  • Propose kernel tuning parameters for low-latency, high-bandwidth inference workloads
  • Design and execute automated qualification tests for RDMA NICs and interconnects
  • Identify and root-cause firmware, driver, and hardware issues impacting RDMA performance and reliability
  • Collaborate with ODMs and silicon vendors to validate new RDMA features
  • Implement and validate peer RDMA support for GPU-to-GPU and accelerator-to-accelerator communication
  • Modify kernel drivers and user-space libraries to optimize direct memory access
  • Profile and benchmark inter-node RDMA latency and bandwidth
  • Optimize NIC and switch configurations for throughput, congestion control, and reliability

Skills

C++
C
Python
Bash
Go
RDMA
RoCE
GPUDirect
Linux
Git
Kernel Drivers

Etched.ai

Develops servers for transformer inference

About Etched.ai

The company specializes in developing powerful servers for transformer inference, utilizing transformer architecture integrated into their chips to achieve highly efficient and advanced technology. The main technologies used in the product are transformer architecture and advanced chip integration.

Cupertino, CA, USAHeadquarters
2022Year Founded
$5.4MTotal Funding
SEEDCompany Stage
HardwareIndustries
11-50Employees

Land your dream remote job 3x faster with AI