Machine Learning Systems Research Engineer, Agent Post-training - Enterprise GenAI at Scale AI

New York, New York, United States

Scale AI Logo
Not SpecifiedCompensation
Junior (1 to 2 years), Mid-level (3 to 4 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, Generative AI, Enterprise AIIndustries

Requirements

  • At least 1-3 years of LLM training in a production environment
  • Passionate about system optimization
  • Experience with post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc
  • Ability to demonstrate know-how on how to operate the architecture of the modern GPU cluster
  • Experience with multi-node LLM training and inference
  • Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc
  • Strong written and verbal communication skills to operate in a cross functional team environment
  • PhD or Masters in Computer Science or a related field

Responsibilities

  • Build, profile and optimize our training and inference framework
  • Post-train state of the art models, developed both internally and from the community, to define stable post-training recipes for our enterprise engagements
  • Collaborate with ML teams to accelerate their research and development, and enable them to develop the next generation of models and data curation
  • Create a next-gen agent training algorithm for multi-agent/multi-tool rollouts

Skills

Key technologies and capabilities for this role

Machine LearningLLM TrainingRLHFRLVRPPOGRPOSystem OptimizationTraining FrameworkInference OptimizationPost-trainingAgent TrainingMulti-agent Systems

Questions & Answers

Common questions about this position

What is the compensation structure for this role?

Compensation packages include base salary, equity, and benefits. The salary range displayed on each job posting reflects the minimum and maximum target for new hires, determined by work location, skills, experience, interview performance, and education. Your recruiter can share more about the specific salary range for your preferred location.

Is this position remote or does it require office work?

This information is not specified in the job description.

What skills and experience are required for this Machine Learning Systems Research Engineer role?

Ideal candidates have 1-3 years of LLM training in production, experience with post-training methods like RLHF/RLVR and PPO/GRPO, multi-node LLM training and inference, proficiency in CUDA, Pytorch, transformers, flash attention, and strong software engineering skills. A PhD or Masters in Computer Science or related field is preferred, along with passion for system optimization and GPU cluster architecture knowledge.

What is the team environment like at Scale AI for this role?

You'll collaborate with ML teams in the Enterprise ML Research Lab to accelerate their research and development, working cross-functionally with other MLREs and AAIs on the Enterprise AI team serving enterprise clients.

What makes a strong candidate for this position?

Strong candidates demonstrate 1-3+ years of production LLM training experience, expertise in post-training algorithms like RLHF and PPO, proficiency with GPU clusters and tools like Pytorch and CUDA, plus a relevant advanced degree.

Scale AI

AI platform for data and models

About Scale AI

Scale AI provides a platform that helps businesses develop AI applications by utilizing their enterprise data to customize generative models. The platform includes tools for collecting, curating, and annotating data, as well as features for evaluating and optimizing models. Scale works with a variety of clients, including major tech companies like Microsoft and Meta, government agencies such as the U.S. Army and Airforce, and startups like Brex and OpenSea. What sets Scale apart from its competitors is its comprehensive suite of tools and services that focus on safely unlocking the value of AI. The company's goal is to enhance the performance of advanced language models and generative models, making AI more accessible and effective for its clients.

San Francisco, CaliforniaHeadquarters
2016Year Founded
$1,558.9MTotal Funding
SERIES_FCompany Stage
Data & Analytics, AI & Machine LearningIndustries
1,001-5,000Employees

Benefits

Health, Dental & Vision Coverage - Our health plans give you the flexibility to select the right coverage for you and eligible family members through a variety of plan options.
Easy to use 401(K) - Plan and invest for the future with a 401(k) via Guideline. Scale’s 401(k) plan provides you an opportunity to defer compensation for your long-term savings.
Wellness Fund - We care about the physical, mental, and emotional wellbeing of all Scaliens. Our $100/month wellness stipend can be used for gym memberships, acupuncture, meditation apps, and so much more.
Virtual Social Activities - Being remote has not stopped us from hosting fun virtual events. From trivia night to candle making, we ensure employees are fostering connections & building strong relationships.
Learning & Development - We know how important career growth is for Scaliens, so we offer a $500/year L&D stipend to help support continued development throughout your journey.
Flexible hours allow you to work when you are most productive. You can work with your manager to best plan your daily work schedule.
Generous Paid Time Off - Enjoy time to travel or plan a staycation. We encourage employees to take time off to recharge and prevent burnout. We have a flexible PTO policy where each employee is afforded the flexibility to take planned time-off as needed.
Commuter Benefits - Set aside pre-tax dollars to use on qualified transportation expenses to help ease your commute.
Parental Leave - Balancing work and family is essential, and Scale understands the importance of having adequate leave policies in place to promote a healthy home and work life.

Risks

Legal issues over labor practices could harm Scale AI's reputation and financial standing.
Controversial stance on DEI may deter potential talent and partnerships, affecting growth.
Expansion in San Francisco could increase operational costs amid uncertain economic conditions.

Differentiation

Scale AI offers comprehensive data annotation for AI, unlike competitors focusing on single modalities.
The platform supports diverse clients, from tech giants to government agencies, enhancing its versatility.
Scale AI's commitment to AI safety initiatives sets it apart in responsible AI development.

Upsides

Growing demand for AI data annotation boosts Scale AI's market potential and client base.
Expansion into defense applications opens new revenue streams and strategic partnerships.
Partnerships with AI safety organizations enhance Scale AI's reputation as a responsible AI leader.

Land your dream remote job 3x faster with AI