← ML Research Wiki / 2301.02111

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei [email protected], Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei [email protected] (2023)

Paper Information
arXiv ID
Venue
IEEE Transactions on Audio, Speech, and Language Processing
Domain
natural language processing, speech synthesis, machine learning
SOTA Claim
Yes
Reproducibility
7/10

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.Figure 1: The overview of VALL-E. Unlike the previous pipeline (e.g., phoneme → mel-spectrogram → waveform), the pipeline of VALL-E is phoneme → discrete code → waveform. VALL-E generates the discrete audio codec codes based on phoneme and acoustic code prompts, corresponding to the target content and the speaker's voice. VALL-E directly enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT-3 [Brown et al., 2020].

Summary

This paper presents VALL-E, a novel language model for text-to-speech (TTS) synthesis that approaches the task as a conditional language modeling problem. VALL-E is trained using a large dataset of 60K hours of English speech data, which is significantly larger than what existing TTS systems typically use. The model generates high-quality, personalized speech with only a 3-second audio prompt from an unseen speaker, showcasing its capability for zero-shot TTS. Experimental results demonstrate that VALL-E surpasses previous state-of-the-art TTS systems in terms of speech naturalness and speaker similarity, effectively preserving the acoustic environment and emotional tone of the input prompt. The study introduces the novel use of discrete audio codec codes as an intermediate representation in the creating of synthesized speech, leveraging a hierarchical autoregressive and non-autoregressive modeling strategy.

Methods

This paper employs the following methods:

  • Conditional Language Modeling
  • Hierarchical Autoregressive & Non-Autoregressive Modeling

Models Used

  • VALL-E
  • WavLM-TDNN

Datasets

The following datasets were used in this research:

  • LibriLight
  • LibriSpeech
  • VCTK

Evaluation Metrics

  • CMOS
  • SMOS
  • Word Error Rate (WER)
  • Equal Error Rate (EER)

Results

  • VALL-E significantly improves speech naturalness and speaker similarity over state-of-the-art models.
  • Maintains acoustic environment and emotional tone of input prompts.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: 16
  • GPU Type: NVIDIA TESLA V100 32GB

Keywords

neural codec language model zero-shot text-to-speech speech synthesis discrete codes

Papers Using Similar Methods

External Resources