ML Research Wiki / Benchmarks / Video Question Answering / MSRVTT-QA

MSRVTT-QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 14 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Mirasol3B	Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	50.42	2023-11-09	-
2	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	50.10	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
3	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	49.20	2023-04-17	📦 TXH-mercury/VALOR
4	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	49.20	2023-06-15	📦 txh-mercury/cosa
5	MA-LMM	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	48.50	2024-04-08	📦 boheumd/MA-LMM
6	mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	48.00	2023-02-01	📦 modelscope/modelscope 📦 x-plug/mplug-owl 📦 alibaba/AliceMind 📦 X-PLUG/mPLUG-2
7	FrozenBiLM 📚	Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	47.00	2022-06-16	📦 antoyang/FrozenBiLM 📦 klauscc/dam 📦 sts-vlcc/sts-vlcc
8	HBI	Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning	46.20	2023-03-25	📦 jpthu17/emcl 📦 jpthu17/diffusionret 📦 jpthu17/HBI 📦 jpthu17/dicosa
9	EMCL-Net	Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	45.80	2022-11-21	📦 jpthu17/emcl 📦 jpthu17/diffusionret 📦 jpthu17/HBI 📦 jpthu17/dicosa
10	VindLU 📚	VindLU: A Recipe for Effective Video-and-Language Pretraining	44.60	2022-12-09	📦 klauscc/vindlu

All Papers (14)

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

2023

Mirasol3B

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

2024

MA-LMM

boheumd/MA-LMM

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

2023

mPLUG-2

modelscope/modelscope x-plug/mplug-owl

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

2023

HBI

jpthu17/emcl jpthu17/diffusionret

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

2022

EMCL-Net

jpthu17/emcl jpthu17/diffusionret

VindLU: A Recipe for Effective Video-and-Language Pretraining

2022

VindLU

klauscc/vindlu

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

2022

VIOLETv2

tsujuifu/pytorch_empirical-mvm

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity-temporal

jayleicn/ClipBERT jayleicn/singularity

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity

jayleicn/ClipBERT jayleicn/singularity

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM (0-shot)

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

MSRVTT-QA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (14)

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

VindLU: A Recipe for Effective Video-and-Language Pretraining

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Revealing Single Frame Bias for Video-and-Language Learning

Revealing Single Frame Bias for Video-and-Language Learning

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Model	Paper	Accuracy	Date
Mirasol3B	Mirasol3B: A Multimodal Autoregressive model for …	50.42	2023-11-09
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	50.10	2023-05-29
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	49.20	2023-04-17
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	49.20	2023-06-15
MA-LMM	MA-LMM: Memory-Augmented Large Multimodal Model f…	48.50	2024-04-08
mPLUG-2	mPLUG-2: A Modularized Multi-modal Foundation Mod…	48.00	2023-02-01
FrozenBiLM	Zero-Shot Video Question Answering via Frozen Bid…	47.00	2022-06-16
HBI	Video-Text as Game Players: Hierarchical Banzhaf …	46.20	2023-03-25
EMCL-Net	Expectation-Maximization Contrastive Learning for…	45.80	2022-11-21
VindLU	VindLU: A Recipe for Effective Video-and-Language…	44.60	2022-12-09
VIOLETv2	An Empirical Study of End-to-End Video-Language T…	44.50	2022-09-04
Singularity-temporal	Revealing Single Frame Bias for Video-and-Languag…	43.90	2022-06-07
Singularity	Revealing Single Frame Bias for Video-and-Languag…	43.50	2022-06-07
FrozenBiLM (0-shot)	Zero-Shot Video Question Answering via Frozen Bid…	16.70	2022-06-16