MUSIC-AVQA

Dataset Information
Modalities
Videos, Audio
Languages
English
Introduced
2022
License
MIT
Homepage

Overview

The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

Variants: MUSIC-AVQA

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Audio-visual Question Answering CAD CAD -- Contextual Multi-modal Alignment … 2023-10-25
Audio-visual Question Answering VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Audio-visual Question Answering VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Audio-visual Question Answering LAVISH Vision Transformers are Parameter-Efficient Audio-Visual … 2022-12-15
Audio-visual Question Answering ST-AVQA Learning to Answer Questions in … 2022-03-26

Research Papers

Recent papers with results on this dataset: