MediBeng

Synthetic Code-Switched Bengali-English Speech Conversations for Healthcare Applications

Dataset Information
Modalities
Texts, Audio, Medical, Speech
Languages
English, Bengali
Introduced
2025
License
Homepage

Overview

MediBeng Dataset

The MediBeng dataset contains synthetic code-switched dialogues in Bengali and English for training models in speech recognition (ASR), text-to-speech (TTS), and machine translation in clinical settings. The dataset is available under the CC-BY-4.0 license.

Dataset Details:

  • Number of Audio Files: 4800
  • Total Duration: 7.11 hours
  • Number of Speakers: 2 (1 Male, 1 Female)
  • Utterance Pitch Mean: 335 - 673 Hz
  • Utterance Pitch Standard Deviation: 210 - 493 Hz
  • Sampling Rate: 16000 Hz
  • Data Split: Train and Test
  • Duration Range: 3.71s - 6.98s
  • Languages: Code-mixed Bengali-English
  • Gender Distribution: 1 Male, 1 Female
  • Total File Size: 324 MB
  • Speech Type: Medical-related
  • Data Type: Synthetic
  • Languages: Bengali, English
  • Tasks: ASR, TTS, Machine Translation
  • Context: Clinical (Healthcare)
  • License: CC-BY-4.0

MediBeng Dataset Columns

  • audio: Synthetic Bengali-English clinical conversations.
  • text: Code-switched Bengali-English conversations.
  • translation: English translation.
  • speaker_name: Speaker's gender (e.g., Male, Female).
  • utterance_pitch_mean: Mean pitch of the audio in Hertz (Hz).
  • utterance_pitch_std: Pitch variation (standard deviation in Hertz).

Dataset Creation:

  1. Audio Collection: Conversations in Bengali-English for healthcare.
  2. Transcription: Code-switched sentences.
  3. Translation: Code-switched sentences English translation.
  4. Feature Engineering: Calculating pitch features.
  5. Storage: Available in Parquet format on Hugging Face.

Citation:

bibtex
@misc{promila_ghosh_2025,
    author       = {Promila Ghosh},
    title        = {MediBeng (Revision b05b594)},
    year         = 2025,
    url          = {https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng},
    doi          = {10.57967/hf/5187},
    publisher    = {Hugging Face}
}

Variants: MediBeng

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

No recent benchmark submissions available for this dataset.

Research Papers

No papers with results on this dataset found.