AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).
Source: Audio Retrieval with Natural Language Queries
Image source: https://audiocaps.github.io/
Variants: AudioCaps
This dataset is used in 4 benchmarks:
Recent papers with results on this dataset: