Research Scientist – Speech and Audio Understanding (Large Models & Multimodal Systems) at Tencent

Bellevue, Washington, United States

Tencent Logo
Not SpecifiedCompensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
Technology, Artificial IntelligenceIndustries

Requirements

  • Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field; or Master’s degree with several years of relevant experience
  • Solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures
  • Proficient in one or more core speech system development pipelines such as ASR, TTS, or speech translation; experience with multilingual, multitask, or end-to-end systems is a plus
  • Proficient in deep learning frameworks such as PyTorch or TensorFlow; experience with large-scale training and distributed systems is a plus
  • Familiar with Transformer-based architectures and their applications in speech and multimodal training/inference
  • In-depth research or practical experience in speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper) strongly preferred
  • Experience in multimodal alignment and cross-modal modeling (e.g., audio-visual-text) strongly preferred
  • Experience driving state-of-the-art (SOTA) performance on audio understanding tasks with large models strongly preferred

Responsibilities

  • Develop general-purpose, end-to-end large speech models covering multilingual automatic speech recognition (ASR), speech translation, speech synthesis, paralinguistic understanding, and general audio understanding
  • Advance research on speech representation learning and encoder/decoder architectures to build unified acoustic representations for multi-task and multimodal applications
  • Explore representation alignment and fusion mechanisms between audio/speech and other modalities in large multimodal models, enabling joint modeling with image and text
  • Build and maintain high-quality multimodal speech datasets, including automatic annotation and data synthesis technologies

Skills

Key technologies and capabilities for this role

PyTorchTensorFlowASRTTSSpeech TranslationHuBERTWav2VecWhisperSpeech Representation LearningMultimodal AlignmentAcoustic ModelingLanguage ModelingDeep LearningEnd-to-End Systems

Questions & Answers

Common questions about this position

What is the salary range for this Research Scientist position?

The expected base pay range is $141,480.00 to $265,200.00 per year. Actual pay may vary depending on job-related knowledge, skills, and experience.

What is the location for this role?

The position is located in US-Washington-Bellevue.

What education and skills are required for this position?

A Ph.D. in Computer Science, Electrical Engineering, Artificial Intelligence, Linguistics, or a related field is required, or a Master’s degree with several years of relevant experience. Candidates need a solid understanding of speech and audio signal processing, acoustic modeling, language modeling, and large model architectures, plus proficiency in core speech system development pipelines such as ASR, TTS, or speech translation. Proficiency in deep learning frameworks like PyTorch or TensorFlow is also required.

What benefits are offered for this role?

Employees may be eligible for a sign-on payment, relocation package, and restricted stock units on a case-by-case basis, plus medical, dental, vision, life and disability benefits, 401(k) plan participation, 15-25 days of vacation, up to 13 holidays, and up to 10 days of paid sick leave.

What experience makes a candidate stand out for this role?

Strongly preferred experience includes speech representation pretraining (e.g., HuBERT, Wav2Vec, Whisper), multimodal alignment and cross-modal modeling (e.g., audio-visual-text), driving SOTA performance on audio understanding tasks with large models, and experience with multilingual, multitask, end-to-end systems, large-scale training, and distributed systems.

Tencent

Internet platform for social, gaming, fintech

About Tencent

Tencent is a technology company that focuses on enhancing the daily lives of internet users and assisting businesses in their digital transformation. It operates in various sectors, including social networking, entertainment, fintech, and cloud computing. Tencent's main products include WeChat, a messaging and mobile payment app with over a billion users, and Tencent Games, which produces popular video games like Honor of Kings and PUBG Mobile. The company generates revenue through online advertising, subscription services, in-app purchases, mobile payments, and cloud services. Unlike many competitors, Tencent has a diverse business model that allows it to serve both individual users and enterprises effectively. The goal of Tencent is to enrich user experiences and support businesses in their digital journeys.

Shenzhen, ChinaHeadquarters
1998Year Founded
$31.5MTotal Funding
IPOCompany Stage
Consumer Software, Enterprise Software, Fintech, AI & Machine Learning, GamingIndustries
10,001+Employees

Benefits

Professional Development Budget

Risks

Tencent's addition to the US blacklist may affect its operations and partnerships.
Developing Call of Duty mobile version may lead to competitive tensions with Microsoft.
Investment in blockchain exposes Tencent to volatile regulatory environments.

Differentiation

Tencent's WeChat app integrates messaging, social media, and mobile payments seamlessly.
Tencent Games is a global leader with popular titles like Honor of Kings and PUBG Mobile.
Tencent Cloud offers scalable solutions for businesses, enhancing digital transformation efforts.

Upsides

Tencent's investment in blockchain technology could enhance its fintech and cloud services.
The Hunyuan-Large language model advances Tencent's AI capabilities in social networking and gaming.
Collaboration with DYXnet on AI solutions opens new avenues in digital transformation services.

Land your dream remote job 3x faster with AI