RealMAN

A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Dataset Information
Modalities
Audio, Speech
Languages
English, Chinese
Introduced
2024
License
Unknown
Homepage

Overview

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset, which provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:

  • Microphone array: A 32-channel microphone array with high-fidelity microphones is used for recording
  • Speech source: A loudspeaker is used for playing source speech signals (about 35 hours of Mandarin speech)
  • Recording duration and scene: A total of 83.7 hours of speech signals (about 48.3 hours for static speaker and 35.4 hours for moving speaker) are recorded in 32 different scenes, and 144.5 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks.
  • Annotation: To obtain the task-specific annotations, speaker location is annotated with an omni-directional fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.

Variants: RealMAN

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Speech Enhancement CleanMel-L-map CleanMel: Mel-Spectrogram Enhancement for Improving … 2025-02-27
Automatic Speech Recognition (ASR) CleanMel-L-mask CleanMel: Mel-Spectrogram Enhancement for Improving … 2025-02-27

Research Papers

Recent papers with results on this dataset: