MSRVTT-QA

Dataset Information
License
Unknown
Homepage

Overview

The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video to Text) dataset. The MSR-VTT-QA benchmark is used to evaluate models on their ability to answer questions based on these videos. It's part of the tasks that this dataset is used for, along with Video Retrieval, Video Captioning, Zero-Shot Video Question Answering, Zero-Shot Video Retrieval, and Text-to-Video Generation.

Variants: MSRVTT-QA

Associated Benchmarks

This dataset is used in 4 benchmarks:

Recent Benchmark Submissions

Task Model Paper Date
Video Question Answering MA-LMM MA-LMM: Memory-Augmented Large Multimodal Model … 2024-04-08
Visual Question Answering (VQA) vid-TLDR (UMT-L) vid-TLDR: Training Free Token merging … 2024-03-20
Video Question Answering Mirasol3B Mirasol3B: A Multimodal Autoregressive model … 2023-11-09
Visual Question Answering (VQA) All-in-one+ Open-vocabulary Video Question Answering: A … 2023-08-18
Visual Question Answering (VQA) FrozenBiLM+ Open-vocabulary Video Question Answering: A … 2023-08-18
Visual Question Answering (VQA) JustAsk+ Open-vocabulary Video Question Answering: A … 2023-08-18
Visual Question Answering (VQA) GIT+MDF Self-Adaptive Sampling for Efficient Video … 2023-07-09
Visual Question Answering (VQA) AIO+MIF Self-Adaptive Sampling for Efficient Video … 2023-07-09
Visual Question Answering (VQA) AIO+MDF Self-Adaptive Sampling for Efficient Video … 2023-07-09
Video Question Answering COSA COSA: Concatenated Sample Pretrained Vision-Language … 2023-06-15
Video Question Answering VAST VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation … 2023-05-29
Visual Question Answering (VQA) VLAB VLAB: Enhancing Video Language Pre-training … 2023-05-22
Video Question Answering VALOR VALOR: Vision-Audio-Language Omni-Perception Pretraining Model … 2023-04-17
Visual Question Answering (VQA) MaMMUT MaMMUT: A Simple Architecture for … 2023-03-29
Visual Question Answering (VQA) UMT-L (ViT-L/16) Unmasked Teacher: Towards Training-Efficient Video … 2023-03-28
Visual Question Answering (VQA) HBI Video-Text as Game Players: Hierarchical … 2023-03-25
Video Question Answering HBI Video-Text as Game Players: Hierarchical … 2023-03-25
Visual Question Answering (VQA) MuLTI MuLTI: Efficient Video-and-Language Understanding with … 2023-03-10
Video Question Answering mPLUG-2 mPLUG-2: A Modularized Multi-modal Foundation … 2023-02-01
Visual Question Answering (VQA) mPLUG-2 mPLUG-2: A Modularized Multi-modal Foundation … 2023-02-01

Research Papers

Recent papers with results on this dataset: