ML Research Wiki / Benchmarks / Video Question Answering / NExT-QA

NExT-QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 47 results | 📏 Metric: Accuracy

Top Performing Models

Rank	Model	Paper	Accuracy	Date	Code
1	LinVT-Qwen2-VL (7B)	LinVT: Empower Your Image-level Large Language Model to Understand Videos	85.50	2024-12-06	📦 gls0425/linvt
2	InternVL-2.5(8B)	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	85.50	2024-12-06	📦 opengvlab/internvl
3	VideoLLaMA3(7B)	VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	84.50	2025-01-22	📦 damo-nlp-sg/videollama3
4	PLM-8B	PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	84.10	2025-04-17	📦 facebookresearch/perception_models
5	BIMBA-LLaVA-Qwen2-7B	BIMBA: Selective-Scan Compression for Long-Range Video Question Answering	83.73	2025-03-12	📦 md-mohaiminul/BIMBA
6	PLM-3B	PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	83.40	2025-04-17	📦 facebookresearch/perception_models
7	LLaVA-Video	Video Instruction Tuning With Synthetic Data	83.20	2024-10-03	-
8	NVILA(8B)	NVILA: Efficient Frontier Visual Language Models	82.20	2024-12-05	📦 nvlabs/vila 📦 efficient-large-model/vila
9	Oryx-1.5(7B)	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	81.80	2024-09-19	📦 oryx-mllm/oryx
10	Qwen2-VL(7B)	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	81.20	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA

All Papers (47)

LinVT: Empower Your Image-level Large Language Model to Understand Videos

2024

LinVT-Qwen2-VL (7B)

gls0425/linvt

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

2024

InternVL-2.5(8B)

opengvlab/internvl

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

2025

VideoLLaMA3(7B)

damo-nlp-sg/videollama3

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-8B

facebookresearch/perception_models

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

2025

BIMBA-LLaVA-Qwen2-7B

md-mohaiminul/BIMBA

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-3B

facebookresearch/perception_models

Video Instruction Tuning With Synthetic Data

2024

LLaVA-Video

NVILA: Efficient Frontier Visual Language Models

2024

NVILA(8B)

nvlabs/vila efficient-large-model/vila

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

2024

Oryx-1.5(7B)

oryx-mllm/oryx

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL(7B)

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

2024

LongVILA(7B)

nvlabs/vila

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-1B

facebookresearch/perception_models

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OV(72B)

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2_HD_mistral

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OV(7B)

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

2024

LLaVA-NeXT-Interleave(14B)

LLaVA-VL/LLaVA-NeXT pwc-1/Paper-9 dinhvietcuong1996/icme25-inova

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2_mistral

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

2024

mPLUG-Owl3(8B)

x-plug/mplug-owl

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

2024

LLaVA-NeXT-Interleave(7B)

LLaVA-VL/LLaVA-NeXT pwc-1/Paper-9 dinhvietcuong1996/icme25-inova

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

2024

LLaVA-NeXT-Interleave(DPO)

LLaVA-VL/LLaVA-NeXT pwc-1/Paper-9 dinhvietcuong1996/icme25-inova

Vamos: Versatile Action Models for Video Understanding

2023

Vamos

brown-palm/Vamos

ViLA: Efficient Video-Language Alignment for Video Question Answering

2023

ViLA (3B)

xijun-cs/vila

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2.1(7B)

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

2023

LLaMA-VQA (33B)

mlvlab/Flipped-VQA

ViLA: Efficient Video-Language Alignment for Video Question Answering

2023

ViLA (3B, 4 frames)

xijun-cs/vila

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

2024

CREMA

Yui010206/CREMA

Self-Chained Image-Language Model for Video Localization and Question Answering

2023

SeViLA

yui010206/sevila

Text-Conditioned Resampler For Long Form Video Understanding

2023

TCR

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

2024

LSTP

bigai-nlco/videotgb bigai-nlco/lstp-chat

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

2023

Mirasol3B

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

RTQ: Rethinking Video-language Understanding Based on Image-text Model

2023

RTQ

SCZwangxiao/RTQ-MM2023 sczwangxiao/tsgvs-mm2023

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

2022

HiTeA

Contrastive Video Question Answering via Video Graph Transformer

2023

CoVGT(PT)

doc-doc/covgt

Semi-Parametric Video-Grounded Text Generation

2023

SeViT

ViperGPT: Visual Inference via Python Execution for Reasoning

2023

ViperGPT(0-shot)

cvlab-columbia/viper

Contrastive Video Question Answering via Video Graph Transformer

2023

CoVGT

doc-doc/covgt

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

2024

GF

byz0e/glance-focus

Verbs in Action: Improving verb understanding in video-language models

2023

VFC

google-research/scenic

ATM: Action Temporality Modeling for Video Question Answering

2023

ATM

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

2022

MIST

showlab/mist

Video Graph Transformer for Video Question Answering

2022

VGT(PT)

sail-sg/vgt

Paxion: Patching Action Knowledge in Video-Language Foundation Models

2023

PAXION

mikewangwzhl/paxion

Video Graph Transformer for Video Question Answering

2022

VGT

sail-sg/vgt

Revisiting the "Video" in Video-Language Understanding

2022

ATP

stanfordvl/atp-video-language

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

2022

P3D-G

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

2021

HQGA

doc-doc/hqga

Model	Paper	Accuracy	Date
LinVT-Qwen2-VL (7B)	LinVT: Empower Your Image-level Large Language Mo…	85.50	2024-12-06
InternVL-2.5(8B)	Expanding Performance Boundaries of Open-Source M…	85.50	2024-12-06
VideoLLaMA3(7B)	VideoLLaMA 3: Frontier Multimodal Foundation Mode…	84.50	2025-01-22
PLM-8B	PerceptionLM: Open-Access Data and Models for Det…	84.10	2025-04-17
BIMBA-LLaVA-Qwen2-7B	BIMBA: Selective-Scan Compression for Long-Range …	83.73	2025-03-12
PLM-3B	PerceptionLM: Open-Access Data and Models for Det…	83.40	2025-04-17
LLaVA-Video	Video Instruction Tuning With Synthetic Data	83.20	2024-10-03
NVILA(8B)	NVILA: Efficient Frontier Visual Language Models	82.20	2024-12-05
Oryx-1.5(7B)	Oryx MLLM: On-Demand Spatial-Temporal Understandi…	81.80	2024-09-19
Qwen2-VL(7B)	Qwen2-VL: Enhancing Vision-Language Model's Perce…	81.20	2024-09-18
LongVILA(7B)	LongVILA: Scaling Long-Context Visual Language Mo…	80.70	2024-08-19
PLM-1B	PerceptionLM: Open-Access Data and Models for Det…	80.30	2025-04-17
LLaVA-OV(72B)	LLaVA-OneVision: Easy Visual Task Transfer	80.20	2024-08-06
VideoChat2_HD_mistral	MVBench: A Comprehensive Multi-modal Video Unders…	79.50	2023-11-28
LLaVA-OV(7B)	LLaVA-OneVision: Easy Visual Task Transfer	79.40	2024-08-06
LLaVA-NeXT-Interleave(14B)	LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…	79.10	2024-07-10
VideoChat2_mistral	MVBench: A Comprehensive Multi-modal Video Unders…	78.60	2023-11-28
mPLUG-Owl3(8B)	mPLUG-Owl3: Towards Long Image-Sequence Understan…	78.60	2024-08-09
LLaVA-NeXT-Interleave(7B)	LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…	78.20	2024-07-10
LLaVA-NeXT-Interleave(DPO)	LLaVA-NeXT-Interleave: Tackling Multi-image, Vide…	77.90	2024-07-10
Vamos	Vamos: Versatile Action Models for Video Understa…	77.30	2023-11-22
ViLA (3B)	ViLA: Efficient Video-Language Alignment for Vide…	75.60	2023-12-13
VideoLLaMA2.1(7B)	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	75.60	2024-06-11
LLaMA-VQA (33B)	Large Language Models are Temporal and Causal Rea…	75.50	2023-10-24
ViLA (3B, 4 frames)	ViLA: Efficient Video-Language Alignment for Vide…	74.40	2023-12-13
CREMA	CREMA: Generalizable and Efficient Video-Language…	73.90	2024-02-08
SeViLA	Self-Chained Image-Language Model for Video Local…	73.80	2023-05-11
TCR	Text-Conditioned Resampler For Long Form Video Un…	73.50	2023-12-19
LSTP	Efficient Temporal Extrapolation of Multimodal La…	72.10	2024-02-25
Mirasol3B	Mirasol3B: A Multimodal Autoregressive model for …	72.00	2023-11-09
VideoChat2	MVBench: A Comprehensive Multi-modal Video Unders…	68.60	2023-11-28
RTQ	RTQ: Rethinking Video-language Understanding Base…	63.20	2023-12-01
HiTeA	HiTeA: Hierarchical Temporal-Aware Video-Language…	63.10	2022-12-30
CoVGT(PT)	Contrastive Video Question Answering via Video Gr…	60.70	2023-02-27
SeViT	Semi-Parametric Video-Grounded Text Generation	60.60	2023-01-27
ViperGPT(0-shot)	ViperGPT: Visual Inference via Python Execution f…	60.00	2023-03-14
CoVGT	Contrastive Video Question Answering via Video Gr…	60.00	2023-02-27
GF	Glance and Focus: Memory Prompting for Multi-Even…	58.83	2024-01-03
VFC	Verbs in Action: Improving verb understanding in …	58.60	2023-04-13
ATM	ATM: Action Temporality Modeling for Video Questi…	58.30	2023-09-05
MIST	MIST: Multi-modal Iterative Spatial-Temporal Tran…	57.20	2022-12-19
VGT(PT)	Video Graph Transformer for Video Question Answer…	56.90	2022-07-12
PAXION	Paxion: Patching Action Knowledge in Video-Langua…	56.90	2023-05-18
VGT	Video Graph Transformer for Video Question Answer…	55.00	2022-07-12
ATP	Revisiting the "Video" in Video-Language Understa…	54.30	2022-06-03
P3D-G	(2.5+1)D Spatio-Temporal Scene Graphs for Video Q…	53.40	2022-02-18
HQGA	Video as Conditional Graph Hierarchy for Multi-Gr…	51.40	2021-12-12

NExT-QA

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (47)