ML Research Wiki / Benchmarks / Video Question Answering / How2QA

How2QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 7 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	Text + Text (no Multimodal Pretext Training)	Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval	93.20	2022-06-05	📦 xudonglinthu/upgradable-multimodal-intelligence
2	FrozenBiLM 📚	Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	86.70	2022-06-16	📦 antoyang/FrozenBiLM 📦 klauscc/dam 📦 sts-vlcc/sts-vlcc
3	Just Ask 📚	Just Ask: Learning to Answer Questions from Millions of Narrated Videos	84.40	2020-12-01	📦 antoyang/just-ask
4	Hero w/ pre-training	HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training	77.75	2020-05-01	📦 linjieli222/HERO 📦 linjieli222/hero_video_feature_extractor 📦 grounded-sport-convai/goal-baselines
5	ATP	Revisiting the "Video" in Video-Language Understanding	65.10	2022-06-03	📦 stanfordvl/atp-video-language
6	FrozenBiLM (0-shot)	Zero-Shot Video Question Answering via Frozen Bidirectional Language Models	58.40	2022-06-16	📦 antoyang/FrozenBiLM 📦 klauscc/dam 📦 sts-vlcc/sts-vlcc
7	Just Ask (0-shot)	Just Ask: Learning to Answer Questions from Millions of Narrated Videos	51.10	2020-12-01	📦 antoyang/just-ask

All Papers (7)

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

2022

Text + Text (no Multimodal Pretext Training)

xudonglinthu/upgradable-multimodal-intelligence

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

2020

Just Ask

antoyang/just-ask

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

2020

Hero w/ pre-training

linjieli222/HERO linjieli222/hero_video_feature_extractor grounded-sport-convai/goal-baselines

Revisiting the "Video" in Video-Language Understanding

2022

ATP

stanfordvl/atp-video-language

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022

FrozenBiLM (0-shot)

antoyang/FrozenBiLM klauscc/dam sts-vlcc/sts-vlcc

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

2020

Just Ask (0-shot)

antoyang/just-ask

How2QA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (7)

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Revisiting the "Video" in Video-Language Understanding

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Model	Paper	Accuracy	Date
Text + Text (no Multimodal Pretext Training)	Towards Fast Adaptation of Pretrained Contrastive…	93.20	2022-06-05
FrozenBiLM	Zero-Shot Video Question Answering via Frozen Bid…	86.70	2022-06-16
Just Ask	Just Ask: Learning to Answer Questions from Milli…	84.40	2020-12-01
Hero w/ pre-training	HERO: Hierarchical Encoder for Video+Language Omn…	77.75	2020-05-01
ATP	Revisiting the "Video" in Video-Language Understa…	65.10	2022-06-03
FrozenBiLM (0-shot)	Zero-Shot Video Question Answering via Frozen Bid…	58.40	2022-06-16
Just Ask (0-shot)	Just Ask: Learning to Answer Questions from Milli…	51.10	2020-12-01