ScoutML | A research assistant that helps researchers generate insights and code to accelerate their work.

1 Image, 2*2 Stitchi

FQL-Driving

FQL-driving

📊 1 results

📏 Metrics: 0..5sec

Arabic Text Diacritization

CATT

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple …

📊 11 results

📏 Metrics: DER(%), WER (%)

Audio Generation

AudioCaps

AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds …

📊 23 results

📏 Metrics: FD_openl3, FAD, FD, KL_passt, IS, CLAP_LAION, CLAP_MS

Audio-Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 8 results

📏 Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 12 results

📏 Metrics: Word Error Rate (WER)

Automatic Speech Recognition (ASR)

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 9 results

📏 Metrics: Test WER

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 2 results

📏 Metrics: WER, Word Error Rate (WER)

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

📊 1 results

📏 Metrics: CER

Sagalee

Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced …

📊 2 results

📏 Metrics: Test WER

Chatbot

AlpacaEval

The AlpacaEval set contains 805 instructions form self-instruct, open-assistant, vicuna, koala, hh-rlhf. Those were selected so that the AlpacaEval ranking …

📊 1 results

📏 Metrics: Average win rate

Cultural Vocal Bursts Intensity Prediction

HUME-VB

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train …

📊 1 results

📏 Metrics: Concordance correlation coefficient (CCC)

DeepFake Detection

1

111

📊 1 results

📏 Metrics: 0L

CIFAKE: Real and AI-Generated Synthetic Images

The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness. CIFAKE is a dataset that …

📊 1 results

📏 Metrics: AUC, Validation Accuracy

DFDC

The DFDC (Deepfake Detection Challenge) is a dataset for deepface detection consisting of more than 100,000 videos. The DFDC dataset …

📊 3 results

📏 Metrics: AUC, LogLoss

FaceForensics

FaceForensics is a video dataset consisting of more than 500,000 frames containing faces from 1004 videos that can be used …

📊 1 results

📏 Metrics: DF, FS, FSF, NT, Real, Total Accuracy

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation …

📊 6 results

📏 Metrics: AUC, LogLoss

FakeAVCeleb

FakeAVCeleb is a novel Audio-Video Deepfake dataset that not only contains deepfake videos but respective synthesized cloned audios as well. …

📊 10 results

📏 Metrics: ROC AUC, AP, Accuracy (%)

LAV-DF

Localized Audio Visual DeepFake Dataset (LAV-DF). Paper: Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method …

📊 1 results

📏 Metrics: AUC

Emotion Recognition

EMOTIC

The EMOTIC dataset, named after EMOTions In Context, is a database of images with people in real environments, annotated with …

📊 2 results

📏 Metrics: Top-3 Accuracy (%)

Emomusic

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same …

📊 5 results

📏 Metrics: EmoA, EmoV

FER2013

Fer2013 contains approximately 30,000 facial RGB images of different expressions with size restricted to 48×48, and the main labels of …

📊 1 results

📏 Metrics: 5-class test accuracy

MSP-Podcast

The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus …

📊 1 results

📏 Metrics: Concordance correlation coefficient (CCC)

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 2 results

📏 Metrics: Accuracy, WAR

SEED

The SEED dataset contains subjects' EEG signals when they were watching films clips. The film clips are carefully selected so …

📊 1 results

📏 Metrics: Accuracy

Keyword Spotting

FKD

The football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains …

📊 2 results

📏 Metrics: Accuracy

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, …

📊 2 results

📏 Metrics: Accuracy

VoxForge

VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open …

📊 2 results

📏 Metrics: Accuracy (%)

Speaker Diarization

AliMeeting

AliMeeting corpus consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as …

📊 1 results

📏 Metrics: DER(%)

DIHARD II

The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, …

📊 1 results

📏 Metrics: DER(%), DER - no overlap

Speaker Identification

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 12 results

📏 Metrics: Top-1 (%), Top-5 (%), Number of Params, Accuracy

Speaker Recognition

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 2 results

📏 Metrics: EER

Speaker Verification

CN-CELEB

CN-Celeb is a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 …

📊 2 results

📏 Metrics: EER

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for advancing …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset. VibraVox aims at serving as a valuable resource for …

📊 1 results

📏 Metrics: Test EER, Test min-DCF

VoxCeleb1

VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

📊 16 results

📏 Metrics: EER

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

📊 1 results

📏 Metrics: EER

Speech Emotion Recognition

BERSt

BERSt Dataset We release the BERSt Dataset for various speech recognition tasks including Automatic Speech Recognition (ASR) and Speech Emotion …

📊 3 results

📏 Metrics: Unweighted Accuracy (UA), Weighted Accuracy (WA)

CREMA-D

CREMA-D is an emotional multimodal actor data set of 7,442 original clips from 91 actors. These clips were from 48 …

📊 8 results

📏 Metrics: Accuracy

EmoDB Dataset

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, …

📊 1 results

📏 Metrics: Accuracy, F1

IEMOCAP

Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for …

📊 5 results

📏 Metrics: UA CV, WA CV, UA, WA, F1

LSSED

LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in …

📊 1 results

📏 Metrics: Unweighted Accuracy (UA)

MSP-IMPROV

We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and …

📊 1 results

📏 Metrics: UA

RAVDESS

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains …

📊 1 results

📏 Metrics: Accuracy, F1 Score, Precision, Recall, F1

RESD

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced …

📊 3 results

📏 Metrics: Weighted Accuracy (WA), Unweighted Accuracy (UA), Weighted F1

ShEMO

The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio …

📊 1 results

📏 Metrics: Unweighted Accuracy

Speech Enhancement

DNS Challenge

The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual …

📊 4 results

📏 Metrics: PESQ-NB, PESQ-WB

EARS-WHAM

The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise …

📊 6 results

📏 Metrics: PESQ-WB, SI-SDR, ESTOI, SIGMOS, DNSMOS, POLQA

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 6 results

📏 Metrics: PESQ, STOI, ViSQOL, HASQI, Audio Quality MOS, SDR, ESTOI, HASPI, SI-SDR, SIIB, SNR, SegSNR

RealMAN

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated …

📊 1 results

📏 Metrics: DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, PESQ-WB

VB-DemandEx

Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([−10, −5, 0, 5, 10, 15, …

📊 4 results

📏 Metrics: ESTOI, Number of parameters (M), PESQ (wb), SI-SDR, SSNR

VoiceBank + DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 34 results

📏 Metrics: PESQ (wb), CBAK, COVL, CSIG, STOI, ESTOI, SSNR, SI-SDR, Para. (M)

VoiceBank+DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train …

📊 2 results

📏 Metrics: PESQ, DNSMOS, DNSMOS BAK, DNSMOS OVRL, DNSMOS SIG, ESTOI, SI-SDR, PESQ (wb)

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

📊 1 results

📏 Metrics: PESQ, SDR, SI-SNR

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

📊 2 results

📏 Metrics: PESQ, SI-SDR, ΔPESQ, SI-SNR, SDR

Speech Recognition

AISHELL-1

AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Source: [AISHELL-1: An Open-Source Mandarin …

📊 18 results

📏 Metrics: Word Error Rate (WER), Params(M)

AISHELL-2

AISHELL-2 contains 1000 hours of clean read-speech data from iOS is free for academic usage. Source: [AISHELL-2: Transforming Mandarin ASR …

📊 2 results

📏 Metrics: Word Error Rate (WER)

Common Voice

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded …

📊 2 results

📏 Metrics: Test WER

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality …

📊 5 results

📏 Metrics: WER (%)

GigaSpeech

GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, …

📊 1 results

📏 Metrics: Word Error Rate (WER)

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2]. It could …

📊 1 results

📏 Metrics: Error rate - SNR 0dB

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 1 results

📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 4 results

📏 Metrics: Word Error Rate (WER)

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

📊 2 results

📏 Metrics: Word Error Rate (WER)

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition …

📊 8 results

📏 Metrics: WER for Arabic, WER for French, WER for Spanish, WER for Turkish

SLUE

Spoken Language Understanding Evaluation (SLUE) is a suite of benchmark tasks for spoken language understanding evaluation. It consists of limited-size …

📊 8 results

📏 Metrics: VoxPopuli (Dev), VoxPopuli (Test), VoxCeleb (Dev), VoxCeleb (Test)

SPGISpeech

SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours …

📊 2 results

📏 Metrics: Word Error Rate (WER)

Speech Commands

Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems .

📊 3 results

📏 Metrics: Accuracy (%)

TED-LIUM

The TED-LIUM corpus consists of English-language TED talks. It includes transcriptions of these talks. The audio is sampled at 16kHz. …

📊 2 results

📏 Metrics: Word Error Rate (WER)

TIMIT

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists …

📊 20 results

📏 Metrics: Percentage error

TUDA

Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of …

📊 3 results

📏 Metrics: Test WER

VietMed

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled …

📊 8 results

📏 Metrics: Dev WER, Test WER

WenetSpeech

WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about …

📊 8 results

📏 Metrics: Character Error Rate (CER)

Speech Separation

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 8 results

📏 Metrics: SI-SNRi, SDRi, PESQ, STOI

LibriCSS

Continuous speech separation (CSS) is an approach to handling overlapped speech in conversational audio signals. A real recorded dataset, called …

📊 2 results

📏 Metrics: 0S, 0L, 10%, 20%, 30%, 40%

VoxCeleb2

VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances …

📊 5 results

📏 Metrics: SI-SNRi, SDRi

WHAM!

The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background …

📊 5 results

📏 Metrics: SI-SDRi

WHAMR!

WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech …

📊 17 results

📏 Metrics: SI-SDRi, MACs (G), Number of parameters (M), SDRi

WSJ0-2mix

WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus. Source: [Deep …

📊 36 results

📏 Metrics: SI-SDRi, SDRi, Number of parameters (M), MACs (G)

Speech Synthesis

Blizzard Challenge 2013

The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally …

📊 2 results

📏 Metrics: NLL

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 4 results

📏 Metrics: Mean Opinion Score

LibriTTS

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by …

📊 15 results

📏 Metrics: PESQ, M-STFT, MCD, Periodicity, V/UV F1

Speech-to-Speech Translation

CVSS

CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into …

📊 2 results

📏 Metrics: ASR-BLEU, Parameters

TAT

Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan. This corpus contains …

📊 8 results

📏 Metrics: ASR-BLEU (Dev), ASR-BLEU (Test)

Spoken Language Understanding

Fluent Speech Commands

Fluent Speech Commands is an open source audio dataset for spoken language understanding (SLU) experiments. Each utterance is labeled with …

📊 17 results

📏 Metrics: Accuracy (%)

Snips-SmartLights

The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are …

📊 7 results

📏 Metrics: Accuracy (%)

Snips-SmartSpeaker

The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It …

📊 5 results

📏 Metrics: Accuracy-EN (%), Accuracy-FR (%)

Spoken-SQuAD

In SpokenSQuAD, the document is in spoken form, the input question is in the form of text and the answer …

📊 4 results

📏 Metrics: F1 score

Timers and Such

Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. …

📊 3 results

📏 Metrics: Accuracy (%)

Text Generation

CNN/Daily Mail

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN …

📊 1 results

📏 Metrics: ROUGE-L

COCO Captions

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, …

📊 4 results

📏 Metrics: BLEU-2, BLEU-3, BLEU-4, BLEU-5

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs …

📊 1 results

📏 Metrics: ROUGE-L

CommonGen

CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique …

📊 4 results

📏 Metrics: CIDEr, METEOR, BLEU-4, SPICE

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It …

📊 3 results

📏 Metrics: METEOR

DART

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different …

📊 3 results

📏 Metrics: BLEU, METEOR, FactSpotter

DailyDialog

DailyDialog is a high-quality multi-turn open-domain English dialog dataset. It contains 13,118 dialogues split into a training set with 11,118 …

📊 1 results

📏 Metrics: BLEU-1, BLEU-2, BLEU-3, BLEU-4

HarmfulQA

Paper | Github | Dataset| Model As a part of our research efforts toward making LLMs more safe for public …

📊 1 results

📏 Metrics: ASR

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which …

📊 1 results

📏 Metrics: ROUGE-L

OpenWebText

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit …

📊 2 results

📏 Metrics: eval_loss

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday …

📊 4 results

📏 Metrics: BLEU-1, Perplexity

ReDial

ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. The dataset consists of …

📊 4 results

📏 Metrics: Distinct-3, Distinct-4, Distinct-2, Perplexity

SciQ

The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in …

📊 3 results

📏 Metrics: Accuracy

Text-To-Speech Synthesis

20000 utterances

📊 1 results

📏 Metrics: 10-keyword Speech Commands dataset

LJSpeech

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from …

📊 15 results

📏 Metrics: Audio Quality MOS, Pleasantness MOS, Word Error Rate (WER), MOS, WER (%)

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker …

📊 1 results

📏 Metrics: MOS

Visual Speech Recognition

LRS2

The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences …

📊 2 results

📏 Metrics: Word Error Rate (WER)

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of …

📊 3 results

📏 Metrics: Word Error Rate (WER)

Voice Conversion

VCTK

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about …

📊 1 results

📏 Metrics: Total Length Error (TLE), Word Length Error (WLE), Phone Length Error (PLE)

Machine Learning Benchmarks

1 Image, 2*2 Stitchi

Arabic Text Diacritization

Audio Generation

Audio-Visual Speech Recognition

Automatic Speech Recognition (ASR)

Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced …

Chatbot

Cultural Vocal Bursts Intensity Prediction

DeepFake Detection

Emotion Recognition

Keyword Spotting

Speaker Diarization

Speaker Identification

Speaker Recognition

Speaker Verification

Speech Emotion Recognition

Speech Enhancement

Speech Recognition

Speech Separation

Speech Synthesis

Speech-to-Speech Translation

Spoken Language Understanding

Text Generation

Text-To-Speech Synthesis

Visual Speech Recognition

Voice Conversion