AI Infra Engineer (San Francisco) at Perplexity AI

San Francisco, California, United States

Perplexity AI Logo
$190,000 – $250,000Compensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
AI, TechnologyIndustries

Requirements

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Skills

Key technologies and capabilities for this role

KubernetesSlurmPythonC++PyTorchAWSdistributed trainingGPU clustersYAMLAPIsmonitoringautoscalingHPC

Questions & Answers

Common questions about this position

What is the salary range for the AI Infra Engineer position?

The salary range is $190K - $250K.

Is this role remote or does it require working in San Francisco?

This information is not specified in the job description.

What are the required skills for this AI Infra Engineer role?

Required skills include expert-level Kubernetes administration and YAML configuration management, proficiency with Slurm job scheduling and resource management, Python and C++ programming for systems automation, hands-on experience with PyTorch in distributed training, and strong understanding of networking, storage, and compute for ML workloads.

What is the team structure or company culture like at Perplexity AI?

This information is not specified in the job description.

What makes a strong candidate for this AI Infra Engineer position?

A strong candidate will have strong expertise in Kubernetes administration, hands-on experience with Slurm workload management, experience deploying distributed training systems at scale, and proficiency in Python and C++ for infrastructure automation.

Perplexity AI

Advanced answer engine providing reliable information

About Perplexity AI

Perplexity AI provides an advanced answer engine that delivers accurate and reliable responses to user queries. The platform uses current sources to ensure the information is both precise and relevant. It caters to a wide audience, including individuals looking for quick answers and businesses needing detailed information. Unlike many competitors, Perplexity AI emphasizes high-quality, source-backed answers, making it a valuable resource for users seeking trustworthy data. The company's goal is to meet the increasing demand for immediate access to reliable information, generating revenue through subscription fees, advertising, and partnerships.

San Francisco, CaliforniaHeadquarters
2022Year Founded
$890MTotal Funding
LATE_VCCompany Stage
Data & Analytics, Consumer SoftwareIndustries
201-500Employees

Benefits

Health Insurance
Dental Insurance
Vision Insurance
401(k) Retirement Plan
Company Equity

Risks

Pending copyright infringement class action poses legal and financial challenges.
Competition from Google's AI Mode could impact user retention and market share.
Otterly.AI's brand visibility tool may pressure Perplexity to maintain high performance.

Differentiation

Perplexity AI integrates large language models with search engines for precise responses.
The platform offers an open-source environment, enhancing public access to AI tools.
Perplexity's strategic acquisition of Carbon boosts its data connectivity capabilities.

Upsides

Partnership with Tripadvisor enhances travel planning with personalized recommendations.
$500M funding round increases valuation to $9 billion, supporting growth and innovation.
Integration with FactSet attracts financial clients with enhanced data accessibility.

Land your dream remote job 3x faster with AI