Cohere

Member of Technical Staff, Pre-Training Data Engineer

Toronto, Ontario, Canada

Not SpecifiedCompensation
Junior (1 to 2 years)Experience Level
Full TimeJob Type
UnknownVisa
Artificial Intelligence, AI & Machine Learning, Data Engineering, Natural Language ProcessingIndustries

Job Description

Employment Type: Full-Time Location Type: Remote Salary: (Not specified)

Who are we?

Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future!

Why this role?

As a Pre-Training Data Engineer, you will play a pivotal role in developing the data infrastructure that underpins Cohere’s advanced language models. Your responsibilities will encompass the end-to-end management of training data, including ingestion, cleaning, filtering, and optimization, as well as data modeling to ensure datasets are structured and formatted for optimal model performance. You will work with diverse data sources—such as web data, code data, multilingual corpora, and synthetic data—to ensure their quality, diversity, and reliability.

In this role, you will design and implement scalable, robust pipelines for data processing, conduct data ablations to evaluate quality, and experiment with data mixtures to enhance model performance. By combining research and engineering, you will bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization.

Your work will be essential to Cohere’s mission of delivering efficient and reliable language understanding and generation capabilities, driving innovation in natural language processing. If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact.

Please Note: We have offices in London, Paris, Toronto, Ottawa, San Francisco and New York but also embrace being remote-friendly! There are no restrictions on where you can be located for this role.

As a Data Engineer in the Pre-Training team, you will:

  • Design and build scalable data pipelines to ingest, clean, filter, and optimize diverse datasets, including web data, code data, multilingual corpora, and synthetic data.
  • Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance.
  • Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency.
  • Research and implement innovative data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing.
  • Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.

You may be a good fit if you have:

  • Strong software engineering skills, with proficiency in Python and experience building data pipelines.
  • Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
  • Experience working with large-scale datasets, including web data, code data, and multilingual corpora.
  • Knowledge of data quality assessment techniques and experimentation with data mixtures.
  • A passion for bridging research and engineering to solve complex data-related challenges in AI model training.

Bonus: Paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! If you want to work really hard on a...

Skills

Data Infrastructure
Data Ingestion
Data Cleaning
Data Filtering
Data Optimization
Data Modeling
Pipeline Development
Data Ablation
Data Mixture Experimentation
Data Quality Assurance
Multilingual Data
Synthetic Data
Scalable Data Processing
Research and Engineering
Model Performance Optimization
Throughput
Accelerator Utilization

Cohere

Provides NLP tools and LLMs via API

About Cohere

Cohere provides advanced Natural Language Processing (NLP) tools and Large Language Models (LLMs) through a user-friendly API. Their services cater to a wide range of clients, including businesses that want to improve their content generation, summarization, and search functions. Cohere's business model focuses on offering scalable and affordable generative AI tools, generating revenue by granting API access to pre-trained models that can handle tasks like text classification, sentiment analysis, and semantic search in multiple languages. The platform is customizable, enabling businesses to create smarter and faster solutions. With multilingual support, Cohere effectively addresses language barriers, making it suitable for international use.

Toronto, CanadaHeadquarters
2019Year Founded
$914.4MTotal Funding
SERIES_DCompany Stage
AI & Machine LearningIndustries
501-1,000Employees

Risks

Competitors like Google and Microsoft may overshadow Cohere with seamless enterprise system integration.
Reliance on Nvidia chips poses risks if supply chain issues arise or strategic focus shifts.
High cost of AI data center could strain financial resources if government funding is delayed.

Differentiation

Cohere's North platform outperforms Microsoft Copilot and Google Vertex AI in enterprise functions.
Rerank 3.5 model processes queries in over 100 languages, enhancing multilingual search capabilities.
Command R7B model excels in RAG, math, and coding, outperforming competitors like Google's Gemma.

Upsides

Cohere's AI data center project positions it as a key player in Canadian AI.
North platform offers secure AI deployment for regulated industries, enhancing privacy-focused enterprise solutions.
Cohere's multilingual support breaks language barriers, expanding its global market reach.

Land your dream remote job 3x faster with AI