ML Research Wiki / Benchmarks / Video Question Answering / ActivityNet-QA

ActivityNet-QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 36 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	Composing Ensembles of Pre-trained Models via Iterative Consensus	61.20	2022-10-20	-
2	GPT-2 + CLIP-32 (Zero-Shot)	Composing Ensembles of Pre-trained Models via Iterative Consensus	58.40	2022-10-20	-
3	VideoCoCa 📚	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	56.10	2022-12-09	-
4	Mirasol3B	Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	51.13	2023-11-09	-
5	VAST 📚	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	50.40	2023-05-29	📦 TXH-mercury/VALOR 📦 txh-mercury/vast
6	COSA 📚	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	49.90	2023-06-15	📦 txh-mercury/cosa
7	MA-LMM	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	49.80	2024-04-08	📦 boheumd/MA-LMM
8	VideoChat2	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	49.10	2023-11-28	📦 opengvlab/ask-anything 📦 magic-research/PLLaVA 📦 bytedance/tarsier
9	VALOR 📚	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	48.60	2023-04-17	📦 TXH-mercury/VALOR
10	UMT-L (ViT-L/16) 📚	Unmasked Teacher: Towards Training-Efficient Video Foundation Models	47.90	2023-03-28	📦 opengvlab/unmasked_teacher

All Papers (36)

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)

Composing Ensembles of Pre-trained Models via Iterative Consensus

2022

GPT-2 + CLIP-32 (Zero-Shot)

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

2022

VideoCoCa

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

2023

Mirasol3B

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

2023

VAST

TXH-mercury/VALOR txh-mercury/vast

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

2023

COSA

txh-mercury/cosa

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

2024

MA-LMM

boheumd/MA-LMM

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

2023

VALOR

TXH-mercury/VALOR

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

2023

UMT-L (ViT-L/16)

opengvlab/unmasked_teacher

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

2023

LLaMA-VID-13B (2 Token)

lastmile-ai/aiconfig dvlab-research/llama-vid

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

2023

LLaMA-VID-7B (2 Token)

lastmile-ai/aiconfig dvlab-research/llama-vid

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

2023

Chat-UniVi-13B

pku-yuangroup/chat-univi skyworkai/moh

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

2023

BT-Adapter (zero-shot)

farewellthree/BT-Adapter

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

2023

MovieChat

rese1f/MovieChat

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

2023

Video-LLaVA

PKU-YuanGroup/Video-LLaVA PKU-YuanGroup/MoE-LLaVA

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

2023

TESTA (ViT-B/16)

renshuhuai-andy/testa

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

2023

FrozenBiLM+

mlvlab/ovqa

VindLU: A Recipe for Effective Video-and-Language Pretraining

2022

VindLU

klauscc/vindlu

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity-temporal

jayleicn/ClipBERT jayleicn/singularity

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

Revealing Single Frame Bias for Video-and-Language Learning

2022

Singularity

jayleicn/ClipBERT jayleicn/singularity

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

2022

Text + Text (no Multimodal Pretext Training)

xudonglinthu/upgradable-multimodal-intelligence

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

2023

All-in-one+

mlvlab/ovqa

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

2023

VIOLET+

mlvlab/ovqa

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

2020

Just Ask (fine-tune)

antoyang/just-ask

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

2024

LocVLM-Vid-B+

kahnchana/locvlm

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

2024

LocVLM-Vid-B

kahnchana/locvlm

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

2023

Video-ChatGPT

mbzuai-oryx/video-chatgpt qiujihao19/artemis

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

2023

LLaMA Adapter V2

opengvlab/llama-adapter zrrskywalker/llama-adapter Mind23-2/MindCode-140

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

2019

E-SA

MILVLG/activitynet-qa

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

2019

E-MN

MILVLG/activitynet-qa

VideoChat: Chat-Centric Video Understanding

2023

Video Chat

opengvlab/ask-anything

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM (0-shot)

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

2019

E-VQA

MILVLG/activitynet-qa

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

2020

Just Ask (0-shot)

antoyang/just-ask

Model	Paper	Accuracy	Date
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	Composing Ensembles of Pre-trained Models via Ite…	61.20	2022-10-20
GPT-2 + CLIP-32 (Zero-Shot)	Composing Ensembles of Pre-trained Models via Ite…	58.40	2022-10-20
VideoCoCa	VideoCoCa: Video-Text Modeling with Zero-Shot Tra…	56.10	2022-12-09
Mirasol3B	Mirasol3B: A Multimodal Autoregressive model for …	51.13	2023-11-09
VAST	VAST: A Vision-Audio-Subtitle-Text Omni-Modality …	50.40	2023-05-29
COSA	COSA: Concatenated Sample Pretrained Vision-Langu…	49.90	2023-06-15
MA-LMM	MA-LMM: Memory-Augmented Large Multimodal Model f…	49.80	2024-04-08
VideoChat2	MVBench: A Comprehensive Multi-modal Video Unders…	49.10	2023-11-28
VALOR	VALOR: Vision-Audio-Language Omni-Perception Pret…	48.60	2023-04-17
UMT-L (ViT-L/16)	Unmasked Teacher: Towards Training-Efficient Vide…	47.90	2023-03-28
LLaMA-VID-13B (2 Token)	LLaMA-VID: An Image is Worth 2 Tokens in Large La…	47.50	2023-11-28
LLaMA-VID-7B (2 Token)	LLaMA-VID: An Image is Worth 2 Tokens in Large La…	47.40	2023-11-28
Chat-UniVi-13B	Chat-UniVi: Unified Visual Representation Empower…	46.40	2023-11-14
BT-Adapter (zero-shot)	BT-Adapter: Video Conversation is Feasible Withou…	46.10	2023-09-27
MovieChat	MovieChat: From Dense Token to Sparse Memory for …	45.70	2023-07-31
Video-LLaVA	Video-LLaVA: Learning United Visual Representatio…	45.30	2023-11-16
TESTA (ViT-B/16)	TESTA: Temporal-Spatial Token Aggregation for Lon…	45.00	2023-10-29
FrozenBiLM+	Open-vocabulary Video Question Answering: A New B…	44.80	2023-08-18
VindLU	VindLU: A Recipe for Effective Video-and-Language…	44.70	2022-12-09
Singularity-temporal	Revealing Single Frame Bias for Video-and-Languag…	44.10	2022-06-07
FrozenBiLM	Zero-Shot Video Question Answering via Frozen Bid…	43.20	2022-06-16
Singularity	Revealing Single Frame Bias for Video-and-Languag…	43.10	2022-06-07
Text + Text (no Multimodal Pretext Training)	Towards Fast Adaptation of Pretrained Contrastive…	41.40	2022-06-05
All-in-one+	Open-vocabulary Video Question Answering: A New B…	40.00	2023-08-18
VIOLET+	Open-vocabulary Video Question Answering: A New B…	39.70	2023-08-18
Just Ask (fine-tune)	Just Ask: Learning to Answer Questions from Milli…	38.90	2020-12-01
LocVLM-Vid-B+	Learning to Localize Objects Improves Spatial Rea…	38.20	2024-04-11
LocVLM-Vid-B	Learning to Localize Objects Improves Spatial Rea…	37.40	2024-04-11
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understandi…	35.20	2023-06-08
LLaMA Adapter V2	LLaMA-Adapter V2: Parameter-Efficient Visual Inst…	34.20	2023-04-28
E-SA	ActivityNet-QA: A Dataset for Understanding Compl…	31.80	2019-06-06
E-MN	ActivityNet-QA: A Dataset for Understanding Compl…	27.10	2019-06-06
Video Chat	VideoChat: Chat-Centric Video Understanding	26.50	2023-05-10
FrozenBiLM (0-shot)	Zero-Shot Video Question Answering via Frozen Bid…	25.90	2022-06-16
E-VQA	ActivityNet-QA: A Dataset for Understanding Compl…	25.10	2019-06-06
Just Ask (0-shot)	Just Ask: Learning to Answer Questions from Milli…	12.20	2020-12-01

ActivityNet-QA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (36)