AI Infra Engineer (San Francisco) at Perplexity AI

San Francisco, California, United States

Apply Now

$190,000 – $250,000Compensation

Senior (5 to 8 years)Experience Level

Full TimeJob Type

UnknownVisa

AI, TechnologyIndustries

Requirements

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
Experience with deploying and managing distributed training systems at scale
Deep understanding of container orchestration and distributed systems architecture
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
Experience managing GPU clusters and optimizing compute resource utilization
Expert-level Kubernetes administration and YAML configuration management
Proficiency with Slurm job scheduling, resource management, and cluster configuration
Python and C++ programming with focus on systems and infrastructure automation
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
Strong understanding of networking, storage, and compute resource management for ML workloads
Experience developing APIs and managing distributed systems for both batch and real-time workloads
Solid debugging and monitoring skills with expertise in observability tools for containerized environments
Demonstrated experience managing large-scale Kubernetes deployments in production environments
Proven track record with Slurm cluster administration and HPC workload management
Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
Experience supporting both long-running training jobs and high-availability inference services
Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Skills

Key technologies and capabilities for this role

KubernetesSlurmPythonC++PyTorchAWSdistributed trainingGPU clustersYAMLAPIsmonitoringautoscalingHPC

Questions & Answers

Common questions about this position

What is the salary range for the AI Infra Engineer position?

The salary range is $190K - $250K.

Is this role remote or does it require working in San Francisco?

This information is not specified in the job description.

What are the required skills for this AI Infra Engineer role?

Required skills include expert-level Kubernetes administration and YAML configuration management, proficiency with Slurm job scheduling and resource management, Python and C++ programming for systems automation, hands-on experience with PyTorch in distributed training, and strong understanding of networking, storage, and compute for ML workloads.

What is the team structure or company culture like at Perplexity AI?

This information is not specified in the job description.

What makes a strong candidate for this AI Infra Engineer position?

A strong candidate will have strong expertise in Kubernetes administration, hands-on experience with Slurm workload management, experience deploying distributed training systems at scale, and proficiency in Python and C++ for infrastructure automation.

Perplexity AI

Advanced answer engine providing reliable information

About Perplexity AI

Perplexity AI provides an advanced answer engine that delivers accurate and reliable responses to user queries. The platform uses current sources to ensure the information is both precise and relevant. It caters to a wide audience, including individuals looking for quick answers and businesses needing detailed information. Unlike many competitors, Perplexity AI emphasizes high-quality, source-backed answers, making it a valuable resource for users seeking trustworthy data. The company's goal is to meet the increasing demand for immediate access to reliable information, generating revenue through subscription fees, advertising, and partnerships.

San Francisco, CaliforniaHeadquarters

2022Year Founded

$890MTotal Funding

LATE_VCCompany Stage

Data & Analytics, Consumer SoftwareIndustries

201-500Employees

Benefits

Health Insurance

Dental Insurance

Vision Insurance

401(k) Retirement Plan

Company Equity

Risks

Pending copyright infringement class action poses legal and financial challenges.

Competition from Google's AI Mode could impact user retention and market share.

Otterly.AI's brand visibility tool may pressure Perplexity to maintain high performance.

Differentiation

Perplexity AI integrates large language models with search engines for precise responses.

The platform offers an open-source environment, enhancing public access to AI tools.

Perplexity's strategic acquisition of Carbon boosts its data connectivity capabilities.

Upsides

Partnership with Tripadvisor enhances travel planning with personalized recommendations.

$500M funding round increases valuation to $9 billion, supporting growth and innovation.

Integration with FactSet attracts financial clients with enhanced data accessibility.

Land your dream remote job 3x faster with AI

Try Jobo Free