Clotho

Dataset Information
Modalities
Texts, Audio
Languages
English
Introduced
2019
Homepage

Overview

Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.

Source: https://zenodo.org/record/3490684
Image Source: https://arxiv.org/abs/1910.09387

Variants: Clotho

Associated Benchmarks

This dataset is used in 2 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Audio captioning MQ-Cap Enhancing Retrieval-Augmented Audio Captioning with … 2024-10-14
Audio captioning SLAM-AAC SLAM-AAC: Enhancing Audio Captioning with … 2024-10-12
Text to Audio Retrieval PaSST-RoBERTa & Estimated Audio–Caption Correspondences Estimated Audio-Caption Correspondences Improve Language-Based … 2024-08-21
Audio captioning LOAE Enhancing Automated Audio Captioning via … 2024-06-19
Text to Audio Retrieval InternVideo2-6B InternVideo2: Scaling Foundation Models for … 2024-03-22
Audio captioning Audio Flamingo (Pengi trainset) Audio Flamingo: A Novel Audio … 2024-02-02
Audio captioning Qwen-Audio Qwen-Audio: Advancing Universal Audio Understanding … 2023-11-14
Text to Audio Retrieval PaSST–RoBERTa & GPT-augment Advancing Natural-Language Based Audio Retrieval … 2023-08-08
Audio captioning VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Text to Audio Retrieval VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Text to Audio Retrieval ONE-PEACE ONE-PEACE: Exploring One General Representation … 2023-05-18
Text to Audio Retrieval VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Audio captioning VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Text to Audio Retrieval CE(pretraining:SoundDescs) Audio Retrieval with Natural Language … 2021-12-17
Text to Audio Retrieval MMT Audio Retrieval with Natural Language … 2021-12-17
Text to Audio Retrieval MoEE Audio Retrieval with Natural Language … 2021-05-05
Text to Audio Retrieval CE (pretraining:AudioCaps) Audio Retrieval with Natural Language … 2021-05-05
Text to Audio Retrieval MoEE (pretraining:AudioCaps) Audio Retrieval with Natural Language … 2021-05-05
Text to Audio Retrieval CE Audio Retrieval with Natural Language … 2021-05-05
Audio captioning Ensemble The NTT DCASE2020 Challenge Task … 2020-07-01

Research Papers

Recent papers with results on this dataset: