Sr ML Ops Engineer at The Walt Disney Company

Nicasio, California, United States

The Walt Disney Company Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Entertainment, MediaIndustries

Requirements

  • Bachelor’s in Computer Science, Engineering, or a related field. Master’s Degree is preferred
  • 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
  • Expertise in building and maintaining CI/CD pipelines for machine learning applications
  • Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
  • Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
  • Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
  • Experience managing large-scale distributed training workflows and optimizing resource allocation
  • Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
  • Solid understanding of security best practices for machine learning systems and sensitive data handling
  • Strong scripting and programming skills in Python, Bash, or Go
  • Preferred Qualifications
  • Experience with data orchestration tools like DataChain, Weights and Biases, etc, for managing ML workflows
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks
  • Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions for model drift and data quality checks
  • Experience integrating pre-trained foundational models and managing their deployment at scale
  • Contributions to open-source ML Ops projects or relevant research publications

Responsibilities

  • Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
  • Design and optimize CI/CD pipelines specifically tailored for machine learning workflows, ensuring efficient delivery from research to production
  • Implement robust monitoring and logging systems to track model performance and identify potential issues in production environments
  • Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
  • Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
  • Containerize machine learning models and applications using Docker and deploy them via Kubernetes or equivalent orchestration systems
  • Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
  • Implement model versioning, rollback strategies, and governance for maintaining production stability
  • Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
  • Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability

Skills

Docker
Kubernetes
TorchServe
TensorFlow Serving
FastAPI
CI/CD
MLOps
DevOps
Model Deployment
Distributed Training
Monitoring
Logging
Model Versioning

The Walt Disney Company

Leading producers & providers of entertainment and information

About The Walt Disney Company

N/AHeadquarters
1923Year Founded
N/ACompany Stage
10,001+Employees

Land your dream remote job 3x faster with AI