Staff Software Engineer, GPU Infrastructure (HPC) at Cohere

Canada

Apply Now

Not SpecifiedCompensation

Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level

Full TimeJob Type

UnknownVisa

AI, Machine Learning, TechnologyIndustries

Requirements

Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
Self-directed (inferred from "Self-direc")

Responsibilities

Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence
Participate in a 24x7 on-call rotation

Skills

Kubernetes

GPU

TPU

HPC

ML Infrastructure

Superclusters

Multi-cloud

Scalability

Observability

On-call

Cohere

Provides NLP tools and LLMs via API

About Cohere

Cohere provides advanced Natural Language Processing (NLP) tools and Large Language Models (LLMs) through a user-friendly API. Their services cater to a wide range of clients, including businesses that want to improve their content generation, summarization, and search functions. Cohere's business model focuses on offering scalable and affordable generative AI tools, generating revenue by granting API access to pre-trained models that can handle tasks like text classification, sentiment analysis, and semantic search in multiple languages. The platform is customizable, enabling businesses to create smarter and faster solutions. With multilingual support, Cohere effectively addresses language barriers, making it suitable for international use.

Toronto, CanadaHeadquarters

2019Year Founded

$914.4MTotal Funding

SERIES_DCompany Stage

AI & Machine LearningIndustries

501-1,000Employees

Risks

Competitors like Google and Microsoft may overshadow Cohere with seamless enterprise system integration.

Reliance on Nvidia chips poses risks if supply chain issues arise or strategic focus shifts.

High cost of AI data center could strain financial resources if government funding is delayed.

Differentiation

Cohere's North platform outperforms Microsoft Copilot and Google Vertex AI in enterprise functions.

Rerank 3.5 model processes queries in over 100 languages, enhancing multilingual search capabilities.

Command R7B model excels in RAG, math, and coding, outperforming competitors like Google's Gemma.

Upsides

Cohere's AI data center project positions it as a key player in Canadian AI.

North platform offers secure AI deployment for regulated industries, enhancing privacy-focused enterprise solutions.

Cohere's multilingual support breaks language barriers, expanding its global market reach.

Land your dream remote job 3x faster with AI

Try Jobo Free