ML Systems Engineer, Infrastructure & Cloud at Basis

New York, New York, United States

Basis Logo
Not SpecifiedCompensation
Mid-level (3 to 4 years), Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
AI Research, Nonprofit, Machine LearningIndustries

Requirements

  • Demonstrated expertise in ML systems engineering, such as managing distributed training jobs across hundreds of GPUs, debugging and fixing numerical instabilities in large-scale training, building infrastructure for reproducible ML experiments, and optimizing training throughput and resource utilization
  • Deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems
  • Strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements
  • Understanding of the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines
  • Skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems
  • Value documentation and knowledge sharing, maintaining comprehensive logs of issues, solutions, and lessons learned
  • Progress with autonomy while coordinating closely with researchers, anticipating infrastructure needs, preventing problems, and responding quickly to issues

Responsibilities

  • Own distributed training infrastructure including job launchers, checkpointing systems

Skills

Key technologies and capabilities for this role

ML SystemsDistributed TrainingGPU ClustersCloud InfrastructureDevOpsNumerical DebuggingCompute OptimizationTraining FrameworksCloud AdministrationSecurity Compliance

Questions & Answers

Common questions about this position

What skills are required for the ML Systems Engineer role?

Required skills include demonstrated expertise in ML systems engineering such as managing distributed training jobs across hundreds of GPUs, debugging numerical instabilities, and building reproducible ML infrastructure; deep knowledge of PyTorch/JAX distributed strategies like DDP, FSDP, ZeRO; strong cloud skills in AWS/GCP/Azure, Terraform, Kubernetes; and understanding of the full ML stack from hardware to evaluation pipelines.

What is the compensation for this position?

This information is not specified in the job description.

Is this role remote or does it require office work?

This information is not specified in the job description.

What is the company culture like at Basis?

Basis emphasizes a 'logbook culture' for documenting issues and solutions, treats operational excellence as a first-class concern, and is building a collaborative organization that puts human values first.

What makes a strong candidate for this ML Systems Engineer position?

Strong candidates combine deep understanding of ML systems with operational excellence, have experience with distributed training at scale, debugging numerical instabilities, managing cloud infrastructure, and a passion for enabling reproducible research and optimizing compute costs.

Basis

Platform for developing financial applications

About Basis

Basis provides a platform that assists businesses in creating financial applications. The platform includes a variety of tools and integrations that simplify the process of building, testing, and deploying these applications. Clients, which range from financial institutions to fintech startups, can access the platform through a subscription model, paying for its features and support. Basis distinguishes itself from competitors by focusing on streamlining the development process and offering custom solutions tailored to the specific needs of its clients. The company's goal is to empower businesses to innovate and enhance their financial operations through improved application offerings.

1688 Pine St UNIT E211, San Francisco, CA 94109, USAHeadquarters
2022Year Founded
$6MTotal Funding
SEEDCompany Stage
Enterprise Software, FintechIndustries
1-10Employees

Risks

Emerging fintech startups offer similar tools at lower costs.
Regulatory scrutiny on data privacy may increase compliance costs.
Economic downturns could reduce demand for Basis's services.

Differentiation

Basis provides real-time cash flow profiles for B2B lenders.
The platform integrates data from revenue, accounting, and banking sources.
Basis offers a subscription-based model with premium services and custom solutions.

Upsides

Increased demand for real-time financial data analytics benefits Basis's offerings.
The rise of embedded finance aligns with Basis's focus on lending stacks.
Open banking trends enhance Basis's cash flow profile accuracy.

Land your dream remote job 3x faster with AI