Senior Site Reliability Engineer - Fleet Reliability at Lambda

San Francisco, California, United States

Lambda Logo
$155,000 – $224,000Compensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
AI, TechnologyIndustries

Requirements

  • 7+ years in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure (GPU architectures, hardware performance optimization)
  • Solid understanding of Linux-based systems in a distributed environment
  • Proficiency in Python and Go, with experience working with Software Engineering teams
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation

Responsibilities

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect, and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Skills

Key technologies and capabilities for this role

PythonGoLinuxPrometheusGrafanaSumoLogicAnsibleTerraformAWSGCPAzureOCIDatadogOpenTelemetryGPU

Questions & Answers

Common questions about this position

What is the salary range for this Senior Site Reliability Engineer position?

The salary range is $155K - $224K.

Is this role remote or hybrid, and what are the office requirements?

The role is hybrid with a presence requirement of 4 days per week in the San Francisco office. Lambda’s designated work from home day is currently Tuesday.

What are the key technical skills required for this role?

Required skills include strong understanding of modern AI infrastructure (GPU architectures, hardware performance optimization), solid understanding of Linux-based systems in a distributed environment, proficiency in Python and Go, experience with monitoring tools like Prometheus and Grafana, proficiency in automation tools like Ansible and Terraform, and familiarity with cloud platforms like OCI, AWS, GCP, and Azure.

What soft skills are needed for this position?

The role requires excellent problem-solving and troubleshooting skills, strong communication and collaboration skills, and passion for continuous improvement and innovation.

What experience level and background make a strong candidate for this role?

Candidates need 7+ years in Site Reliability Engineering, DevOps, or a similar role. Bonus skills include experience in machine learning or computer hardware, containerization like Docker and Kubernetes, HPC resources, chaos engineering, and compliance frameworks like SOC 2.

Lambda

Cloud-based GPU services for AI training

About Lambda

Lambda Labs provides cloud-based services for artificial intelligence (AI) training and inference, focusing on large language models and generative AI. Their main product, the AI Developer Cloud, utilizes NVIDIA's GH200 Grace Hopper™ Superchip to deliver efficient and cost-effective GPU resources. Customers can access on-demand and reserved cloud GPUs, which are essential for processing large datasets quickly, with pricing starting at $1.99 per hour for NVIDIA H100 instances. Lambda Labs serves AI developers and companies needing extensive GPU deployments, offering competitive pricing and infrastructure ownership options through their Lambda Echelon service. Additionally, they provide Lambda Stack, a software solution that simplifies the installation and management of AI-related tools for over 50,000 machine learning teams. The goal of Lambda Labs is to support AI development by providing accessible and efficient cloud GPU services.

San Jose, CaliforniaHeadquarters
2012Year Founded
$372.6MTotal Funding
DEBTCompany Stage
AI & Machine LearningIndustries
201-500Employees

Risks

Nebius' holistic cloud platform challenges Lambda's market share in AI infrastructure.
AWS's 896-core instance may draw customers seeking high-performance cloud solutions.
Existential crisis in Hermes 3 model raises concerns about Lambda's AI model reliability.

Differentiation

Lambda offers cost-effective Inference API for AI model deployment without infrastructure maintenance.
Nvidia HGX H100 and Quantum-2 InfiniBand Clusters enhance Lambda's AI model training capabilities.
Lambda's Hermes 3 collaboration showcases advanced AI model development expertise.

Upsides

Inference API launch attracts enterprises seeking low-cost AI deployment solutions.
Nvidia HGX H100 clusters provide competitive edge in high-performance AI computing.
Strong AI cloud service growth indicates rising demand for Lambda's GPU offerings.

Land your dream remote job 3x faster with AI