ML Research Wiki / Benchmarks / Visual Question Answering (VQA) / MSRVTT-QA

MSRVTT-QA

Visual Question Answering (VQA) Benchmark

Performance Over Time

📊 Showing 33 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 VLAB 📚 VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending 0.50 2023-05-22 -
2 MaMMUT 📚 MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks 0.50 2023-03-29 📦 lucidrains/mammut-pytorch
3 mPLUG-2 📚 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video 0.48 2023-02-01 📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
4 MuLTI 📚 MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling 0.48 2023-03-10 -
5 Flamingo 📚 Flamingo: a Visual Language Model for Few-Shot Learning 0.47 2022-04-29 📦 mlfoundations/open_flamingo 📦 lucidrains/flamingo-pytorch 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models 📦 doc-doc/NExT-OE 📦 happen2me/cross-gnn
6 InternVideo 📚 InternVideo: General Video Foundation Models via Generative and Discriminative Learning 0.47 2022-12-06 📦 opengvlab/internvideo 📦 yingsen1/unimd
7 UMT-L (ViT-L/16) 📚 Unmasked Teacher: Towards Training-Efficient Video Foundation Models 0.47 2023-03-28 📦 opengvlab/unmasked_teacher
8 FrozenBiLM+ Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models 0.47 2023-08-18 📦 mlvlab/ovqa
9 vid-TLDR (UMT-L) 📚 vid-TLDR: Training Free Token merging for Light-weight Video Transformer 0.47 2024-03-20 📦 mlvlab/vid-tldr
10 VideoCoCa 📚 VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners 0.46 2022-12-09 -

All Papers (33)