[Remote] Senior ML Systems Engineer, Frameworks & Tooling at Cohere

London, England, United Kingdom

Not SpecifiedCompensation

Senior (5 to 8 years)Experience Level

Full TimeJob Type

UnknownVisa

AI, Machine LearningIndustries

Requirements

Strong engineering experience in large-scale distributed training or HPC systems
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience working with containerized environments (Docker, Singularity/Apptainer)
A track record of building tools that increase developer velocity for ML teams
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
Nice to Have
Experience with training LLMs or other large transformer architectures
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.)
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches)
Experience with data pipeline optimization, sharded datasets, or caching strategies
Background in performance engineering, profiling, or low-level systems
Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

Responsibilities

Build and own the training framework responsible for large-scale LLM training
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training
Investigate and resolve performance bottlenecks across the ML systems stack
Build robust systems that ensure reproducible, debuggable, large-scale runs

Skills

Key technologies and capabilities for this role

ML SystemsDistributed TrainingData ParallelismTensor ParallelismPipeline ParallelismFSDPZeROMemory ManagementCheckpointingHPCSlurmGPULLM TrainingMonitoringLoggingDebugging

Questions & Answers

Common questions about this position

Is this position remote?

Yes, this is a remote position.

What skills are required for this Senior ML Systems Engineer role?

Required skills include strong engineering experience in large-scale distributed training or HPC systems, deep familiarity with JAX internals and distributed training libraries, experience with multi-node cluster orchestration like Slurm or Kubernetes, comfort debugging performance issues across CUDA/NCCL and data pipelines, and experience with containerized environments like Docker.

What is the salary or compensation for this role?

This information is not specified in the job description.

What is the company culture like at Cohere?

Cohere obsesses over what they build, works hard and moves fast to serve customers, values a team of the best researchers, engineers, and designers passionate about their craft, and believes diverse perspectives are essential for great products.

What makes a strong candidate for this position?

A strong candidate has a track record of building tools that increase developer velocity for ML teams, excellent judgment on trade-offs like performance vs complexity, and strong collaboration skills to work with infra, research, and deployment teams.

Cohere

Provides NLP tools and LLMs via API

About Cohere

Cohere provides advanced Natural Language Processing (NLP) tools and Large Language Models (LLMs) through a user-friendly API. Their services cater to a wide range of clients, including businesses that want to improve their content generation, summarization, and search functions. Cohere's business model focuses on offering scalable and affordable generative AI tools, generating revenue by granting API access to pre-trained models that can handle tasks like text classification, sentiment analysis, and semantic search in multiple languages. The platform is customizable, enabling businesses to create smarter and faster solutions. With multilingual support, Cohere effectively addresses language barriers, making it suitable for international use.

Toronto, CanadaHeadquarters

2019Year Founded

$914.4MTotal Funding

SERIES_DCompany Stage

AI & Machine LearningIndustries

501-1,000Employees

Risks

Competitors like Google and Microsoft may overshadow Cohere with seamless enterprise system integration.

Reliance on Nvidia chips poses risks if supply chain issues arise or strategic focus shifts.

High cost of AI data center could strain financial resources if government funding is delayed.

Differentiation

Cohere's North platform outperforms Microsoft Copilot and Google Vertex AI in enterprise functions.

Rerank 3.5 model processes queries in over 100 languages, enhancing multilingual search capabilities.

Command R7B model excels in RAG, math, and coding, outperforming competitors like Google's Gemma.

Upsides

Cohere's AI data center project positions it as a key player in Canadian AI.

North platform offers secure AI deployment for regulated industries, enhancing privacy-focused enterprise solutions.

Cohere's multilingual support breaks language barriers, expanding its global market reach.

Land your dream remote job 3x faster with AI

Try Jobo Free