AI/HPC Network Development Engineer - Networking at xAI

Memphis, Tennessee, United States

xAI Logo
Not SpecifiedCompensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
AI, HPCIndustries

Requirements

  • Minimum of 10 years designing and operating large scale networks with 5 years in the ethernet AI/HPC space
  • Deep understanding of congestion control on ethernet with Infiniband an added bonus
  • Deep understanding of AI training and inference workloads and how they operate on the network
  • Ability to use and debug NCCL and potentially commit to the library
  • Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic
  • Experience with Python to automate away repetitive tasks and facilitate working with and analyzing large sets of data
  • Strong communication skills to concisely and accurately share knowledge with teammates
  • Willingness to travel significantly to Memphis for data center buildouts and to Palo Alto for team collaboration
  • Availability for team on-call rotation and helping with scaling and maintenance efforts

Responsibilities

  • Develop at hyper scale while optimizing network performance and availability with deep experience in RoCEv2
  • Analyze and understand current network performance and availability to optimize for training models and customer inference queries
  • Spend time deep inside NCCL, building metric dashboards, and tweaking configurations to maximize performance
  • Help design the next iteration of backend and front-end networks for seamless build-out of new GPU infrastructure with minimal engineering assistance
  • Participate in significant travel to Memphis for building capacity
  • Engage in team on-call rotation and assist with scaling and maintenance efforts
  • Contribute to deployment and operations frameworks to eliminate repetitive tasks

Skills

RoCEv2
NCCL
HPC
Networking
GPU Clusters
Ethernet
Performance Optimization
Metric Dashboards
Scaling
Deployment

xAI

AI tools for research and information retrieval

About xAI

x.ai develops AI tools aimed at enhancing research and information retrieval. Their main product, Grok, is designed to answer a variety of questions, including unconventional ones that other AI systems might not handle. Grok provides real-time knowledge, making it a useful resource for researchers, academics, and professionals who need quick access to relevant information. Unlike competitors, Grok stands out for its ability to suggest questions and provide nuanced answers, catering to a diverse range of inquiries. The goal of x.ai is to empower users by streamlining their research processes and fostering innovation through reliable information access.

Burlingame, CaliforniaHeadquarters
2023Year Founded
$11,803.1MTotal Funding
SERIES_CCompany Stage
Data & Analytics, AI & Machine LearningIndustries
1,001-5,000Employees

Benefits

Health Insurance
Remote Work Options

Risks

Increased competition from Anthropic could challenge xAI's market position.
Legal battles involving Elon Musk may divert resources from xAI's operations.
Reliance on Nvidia GPUs poses risks if supply chain issues arise.

Differentiation

Grok answers unconventional questions, unlike many AI systems.
xAI's Grok provides real-time knowledge, enhancing research efficiency.
Grok's ability to generate striking images sets it apart in visual data processing.

Upsides

xAI secured $6 billion funding, boosting AI infrastructure and R&D.
Grok's iOS app launch expands user accessibility and engagement.
AI-driven research tools are increasingly integrated with cloud platforms, aiding xAI's growth.

Land your dream remote job 3x faster with AI