The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.
Variants: MUSIC-AVQA
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Audio-visual Question Answering | CAD | CAD -- Contextual Multi-modal Alignment … | 2023-10-25 |
Audio-visual Question Answering | VAST | VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … | 2023-05-29 |
Audio-visual Question Answering | VALOR | VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … | 2023-04-17 |
Audio-visual Question Answering | LAVISH | Vision Transformers are Parameter-Efficient Audio-Visual … | 2022-12-15 |
Audio-visual Question Answering | ST-AVQA | Learning to Answer Questions in … | 2022-03-26 |
Recent papers with results on this dataset: