ML Research Wiki / Benchmarks / Video Question Answering / MSRVTT-QA

MSRVTT-QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 14 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 Mirasol3B Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities 50.42 2023-11-09 -
2 VAST 📚 VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 50.10 2023-05-29 📦 TXH-mercury/VALOR 📦 txh-mercury/vast
3 VALOR 📚 VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 49.20 2023-04-17 📦 TXH-mercury/VALOR
4 COSA 📚 COSA: Concatenated Sample Pretrained Vision-Language Foundation Model 49.20 2023-06-15 📦 txh-mercury/cosa
5 MA-LMM MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding 48.50 2024-04-08 📦 boheumd/MA-LMM
6 mPLUG-2 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video 48.00 2023-02-01 📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
7 FrozenBiLM 📚 Zero-Shot Video Question Answering via Frozen Bidirectional Language Models 47.00 2022-06-16 📦 antoyang/FrozenBiLM 📦 klauscc/dam 📦 sts-vlcc/sts-vlcc
8 HBI Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning 46.20 2023-03-25 📦 jpthu17/emcl 📦 jpthu17/diffusionret 📦 jpthu17/HBI 📦 jpthu17/dicosa
9 EMCL-Net Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations 45.80 2022-11-21 📦 jpthu17/emcl 📦 jpthu17/diffusionret 📦 jpthu17/HBI 📦 jpthu17/dicosa
10 VindLU 📚 VindLU: A Recipe for Effective Video-and-Language Pretraining 44.60 2022-12-09 📦 klauscc/vindlu

All Papers (14)