Senior Site Reliability Engineer — GPU Infrastructure at Genmo

San Francisco, California, United States

Genmo Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, Machine LearningIndustries

Requirements

  • BS/MS/PhD in CS, EE, or related field
  • 3+ yrs SRE/DevOps in production
  • 2+ yrs managing large Kubernetes fleets
  • Expert-level Kubernetes experience
  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible)
  • Track record of shipping and operating large-scale infrastructure with high reliability and clear communication

Responsibilities

  • Own the design and day-to-day operation of GPU clusters that train and serve frontier generative models
  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi-cluster federation
  • Define and implement Infrastructure-as-Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux
  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes
  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM
  • Optimize high-performance networking (InfiniBand/RDMA) and debug perf bottlenecks
  • Run and continuously improve the 24×7 on-call rotation; lead post-incident reviews
  • Partner with researchers and engineers, communicate crisply, and ship with a high-ownership mindset

Skills

Key technologies and capabilities for this role

KubernetesTerraformHelmAnsiblePythonBashPrometheusGrafanaOpenTelemetryeBPFNVIDIA DCGMInfiniBandRDMAArgo CDFluxSlurmKueueAWSGCPAzure

Questions & Answers

Common questions about this position

What is the salary for this Senior Site Reliability Engineer position?

This information is not specified in the job description.

Is this a remote position or is there a required location?

This information is not specified in the job description.

What are the minimum qualifications and key skills required for this role?

Minimum qualifications include a BS/MS/PhD in CS, EE, or related field, 3+ years SRE/DevOps in production, 2+ years managing large Kubernetes fleets, expert-level Kubernetes experience, proficiency in Python and Bash and IaC tools (Terraform, Helm, Ansible), and a track record of shipping and operating large-scale infrastructure with high reliability and clear communication.

What will I be doing in this Senior SRE role at Genmo?

You will own the design and operation of GPU clusters, lead Kubernetes operations including GPU scheduling and cluster upgrades, implement Infrastructure-as-Code with Terraform, Helm, Ansible and GitOps, build CI/CD pipelines, develop observability stacks, optimize networking, manage on-call rotations, and partner with researchers and engineers.

What nice-to-have experiences make a candidate stand out for this position?

Nice-to-haves include multi-cluster/multi-cloud production experience (AWS, GCP, Azure, bare-metal), hands-on with containerized GPU stacks (nvidia-container-toolkit, GPU Operator), GPU schedulers like Slurm or Kueue, familiarity with CI/CD tooling (GitHub Actions, BuildKit), and prior work with distributed training, model-serving, or ML/GPU workloads.

Genmo

AI tools for multimedia content creation

About Genmo

Genmo.ai specializes in providing AI tools for generating and editing multimedia content, including images, videos, and presentations. Users can upload images and animate specific parts, like transforming a static sky into a timelapse, or create entire movies by refining ideas, generating scenes, and selecting transitions. The platform caters to both individual content creators and businesses, operating on a subscription model with various service tiers. Genmo.ai differentiates itself by continuously enhancing its technology and focusing on user intent, ensuring that clients have powerful tools to realize their creative projects.

San Francisco, CaliforniaHeadquarters
N/AYear Founded
$29.2MTotal Funding
EARLY_VCCompany Stage
Consumer Software, AI & Machine LearningIndustries
1-10Employees

Risks

Server crashes during Mochi-1 launch could harm customer trust and satisfaction.
Open-source nature of Mochi-1 may lead to increased competition from developers.
Major tech players entering generative AI market could overshadow Genmo's offerings.

Differentiation

Genmo.ai offers unique AI tools for animating images and generating entire movies.
The platform supports both B2B and B2C models, catering to diverse client needs.
Genmo.ai's subscription model provides flexible access to advanced multimedia editing features.

Upsides

Launch of Mochi-1 model positions Genmo as a competitor to industry leaders.
Rising demand for AI-driven video editing boosts Genmo's market potential.
Subscription-based revenue model ensures steady income and opportunities for upselling.

Land your dream remote job 3x faster with AI