AI Research Engineer, Enterprise Evaluations at Scale AI

New York, New York, United States

Scale AI Logo
Not SpecifiedCompensation
Junior (1 to 2 years), Mid-level (3 to 4 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, TechnologyIndustries

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering, a related field, or equivalent practical experience
  • 2+ years of experience in Machine Learning or Applied Research, focused on applied ML systems or evaluation infrastructure
  • Hands-on experience with Large Language Models (LLMs) and Generative AI in professional or research environments
  • Strong understanding of frontier model evaluation methodologies and the current research landscape
  • Proficiency in Python and major ML frameworks (e.g., PyTorch, TensorFlow)
  • Solid engineering and statistical analysis foundation, with experience developing data-driven methods for assessing model quality
  • Advanced degree (Master’s or Ph.D.) in Computer Science, Machine Learning, or a related quantitative field (preferred)
  • Published research in leading ML or AI conferences such as NeurIPS, ICML, ICLR, or KDD (preferred)
  • Experience designing, building, or deploying LLM-as-a-Judge frameworks or other automated evaluation systems for complex models (preferred)
  • Experience collaborating with operations or external teams to define high-quality human annotator guidelines (preferred)
  • Expertise in ML research engineering, stochastic systems, observability, or LLM-powered applications for model evaluation and analysis (preferred)
  • Experience contributing to scalable pipelines that automate the evaluation and monitoring of large-scale models and agents (preferred)
  • Familiarity with distributed computing frameworks and modern cloud infrastructure (preferred)

Responsibilities

  • Partner with Scale’s Operations team and enterprise customers to translate ambiguity into structured evaluation data, guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems
  • Analyze feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments
  • Design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems. This includes creating models that critique, grade, and explain agent outputs (e.g., RLAIF, model-judging-model setups), along with scalable evaluation pipelines and diagnostic tools
  • Pursue research initiatives that explore new methodologies for automatically analyzing, evaluating, and improving the behavior of enterprise agents, pushing the boundaries of how AI systems are assessed and optimized in real-world contexts

Skills

Large Language Models
LLM-as-a-Judge
RLAIF
AI Evaluation
Machine Learning
Evaluation Frameworks
Python
Scalable Pipelines

Scale AI

AI platform for data and models

About Scale AI

Scale AI provides a platform that helps businesses develop AI applications by utilizing their enterprise data to customize generative models. The platform includes tools for collecting, curating, and annotating data, as well as features for evaluating and optimizing models. Scale works with a variety of clients, including major tech companies like Microsoft and Meta, government agencies such as the U.S. Army and Airforce, and startups like Brex and OpenSea. What sets Scale apart from its competitors is its comprehensive suite of tools and services that focus on safely unlocking the value of AI. The company's goal is to enhance the performance of advanced language models and generative models, making AI more accessible and effective for its clients.

San Francisco, CaliforniaHeadquarters
2016Year Founded
$1,558.9MTotal Funding
SERIES_FCompany Stage
Data & Analytics, AI & Machine LearningIndustries
1,001-5,000Employees

Benefits

Health, Dental & Vision Coverage - Our health plans give you the flexibility to select the right coverage for you and eligible family members through a variety of plan options.
Easy to use 401(K) - Plan and invest for the future with a 401(k) via Guideline. Scale’s 401(k) plan provides you an opportunity to defer compensation for your long-term savings.
Wellness Fund - We care about the physical, mental, and emotional wellbeing of all Scaliens. Our $100/month wellness stipend can be used for gym memberships, acupuncture, meditation apps, and so much more.
Virtual Social Activities - Being remote has not stopped us from hosting fun virtual events. From trivia night to candle making, we ensure employees are fostering connections & building strong relationships.
Learning & Development - We know how important career growth is for Scaliens, so we offer a $500/year L&D stipend to help support continued development throughout your journey.
Flexible hours allow you to work when you are most productive. You can work with your manager to best plan your daily work schedule.
Generous Paid Time Off - Enjoy time to travel or plan a staycation. We encourage employees to take time off to recharge and prevent burnout. We have a flexible PTO policy where each employee is afforded the flexibility to take planned time-off as needed.
Commuter Benefits - Set aside pre-tax dollars to use on qualified transportation expenses to help ease your commute.
Parental Leave - Balancing work and family is essential, and Scale understands the importance of having adequate leave policies in place to promote a healthy home and work life.

Risks

Legal issues over labor practices could harm Scale AI's reputation and financial standing.
Controversial stance on DEI may deter potential talent and partnerships, affecting growth.
Expansion in San Francisco could increase operational costs amid uncertain economic conditions.

Differentiation

Scale AI offers comprehensive data annotation for AI, unlike competitors focusing on single modalities.
The platform supports diverse clients, from tech giants to government agencies, enhancing its versatility.
Scale AI's commitment to AI safety initiatives sets it apart in responsible AI development.

Upsides

Growing demand for AI data annotation boosts Scale AI's market potential and client base.
Expansion into defense applications opens new revenue streams and strategic partnerships.
Partnerships with AI safety organizations enhance Scale AI's reputation as a responsible AI leader.

Land your dream remote job 3x faster with AI