Sr ML Ops Engineer at The Walt Disney Company

Nicasio, California, United States

The Walt Disney Company Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Entertainment, MediaIndustries

Requirements

  • Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred
  • 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
  • Expertise in building and maintaining CI/CD pipelines for machine learning applications
  • Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
  • Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
  • Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
  • Experience managing large-scale distributed training workflows and optimizing resource allocation
  • Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
  • Solid understanding of security best practices for machine learning systems and sensitive data handling
  • Strong scripting and programming skills in Python, Bash, or Go
  • Preferred Qualifications
  • Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks
  • Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks
  • Experience integrating pre-trained foundational models and managing their deployment at scale
  • Contributions to open-source ML Ops projects or relevant research publications

Responsibilities

  • Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
  • Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production
  • Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments
  • Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
  • Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
  • Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems
  • Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
  • Implement model versioning, rollback strategies, and governance for maintaining production stability
  • Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
  • Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability

Skills

Key technologies and capabilities for this role

DockerKubernetesTorchServeTensorFlow ServingFastAPICI/CDMLOpsDevOpsModel DeploymentDistributed TrainingMonitoringLoggingModel Versioning

Questions & Answers

Common questions about this position

Is this role remote or hybrid, and where is the office located?

This is a hybrid role requiring 2-3 days onsite at the Nicasio, CA office and occasional work from home.

What are the key skills required for the Sr ML Ops Engineer position?

Key skills include expertise in building CI/CD pipelines for ML applications, strong proficiency with Docker and Kubernetes, proficiency in deploying models using TensorFlow Serving or TorchServe, and deep understanding of cloud infrastructure like AWS, GCP, or Azure for ML workloads.

What is the salary or compensation for this role?

This information is not specified in the job description.

What education and experience are required for this position?

A Bachelor’s in Computer Science, Engineering, or related field is required (Master’s preferred), along with 5+ years in DevOps, SRE, or related roles and at least 2+ years focusing on ML Ops.

What kind of collaboration is involved in this role?

You will collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation.

The Walt Disney Company

Leading producers & providers of entertainment and information

About The Walt Disney Company

N/AHeadquarters
1923Year Founded
N/ACompany Stage
10,001+Employees

Land your dream remote job 3x faster with AI