Modalities
Texts, Audio, Medical, Speech
Languages
English, Bengali
MediBeng Dataset
The MediBeng dataset contains synthetic code-switched dialogues in Bengali and English for training models in speech recognition (ASR), text-to-speech (TTS), and machine translation in clinical settings. The dataset is available under the CC-BY-4.0 license.
Dataset Details:
- Number of Audio Files: 4800
- Total Duration: 7.11 hours
- Number of Speakers: 2 (1 Male, 1 Female)
- Utterance Pitch Mean: 335 - 673 Hz
- Utterance Pitch Standard Deviation: 210 - 493 Hz
- Sampling Rate: 16000 Hz
- Data Split: Train and Test
- Duration Range: 3.71s - 6.98s
- Languages: Code-mixed Bengali-English
- Gender Distribution: 1 Male, 1 Female
- Total File Size: 324 MB
- Speech Type: Medical-related
- Data Type: Synthetic
- Languages: Bengali, English
- Tasks: ASR, TTS, Machine Translation
- Context: Clinical (Healthcare)
- License: CC-BY-4.0
MediBeng Dataset Columns
- audio: Synthetic Bengali-English clinical conversations.
- text: Code-switched Bengali-English conversations.
- translation: English translation.
- speaker_name: Speaker's gender (e.g., Male, Female).
- utterance_pitch_mean: Mean pitch of the audio in Hertz (Hz).
- utterance_pitch_std: Pitch variation (standard deviation in Hertz).
Dataset Creation:
- Audio Collection: Conversations in Bengali-English for healthcare.
- Transcription: Code-switched sentences.
- Translation: Code-switched sentences English translation.
- Feature Engineering: Calculating pitch features.
- Storage: Available in Parquet format on Hugging Face.
Citation:
bibtex
@misc{promila_ghosh_2025,
author = {Promila Ghosh},
title = {MediBeng (Revision b05b594)},
year = 2025,
url = {https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng},
doi = {10.57967/hf/5187},
publisher = {Hugging Face}
}
Variants: MediBeng
This dataset is used in 1 benchmark:
No recent benchmark submissions available for this dataset.
No papers with results on this dataset found.