ML Research Wiki / Benchmarks / Video Question Answering / NExT-QA

NExT-QA

Video Question Answering Benchmark

Performance Over Time

📊 Showing 47 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 LinVT-Qwen2-VL (7B) LinVT: Empower Your Image-level Large Language Model to Understand Videos 85.50 2024-12-06 📦 gls0425/linvt
2 InternVL-2.5(8B) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling 85.50 2024-12-06 📦 opengvlab/internvl
3 VideoLLaMA3(7B) VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding 84.50 2025-01-22 📦 damo-nlp-sg/videollama3
4 PLM-8B PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding 84.10 2025-04-17 📦 facebookresearch/perception_models
5 BIMBA-LLaVA-Qwen2-7B BIMBA: Selective-Scan Compression for Long-Range Video Question Answering 83.73 2025-03-12 📦 md-mohaiminul/BIMBA
6 PLM-3B PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding 83.40 2025-04-17 📦 facebookresearch/perception_models
7 LLaVA-Video Video Instruction Tuning With Synthetic Data 83.20 2024-10-03 -
8 NVILA(8B) NVILA: Efficient Frontier Visual Language Models 82.20 2024-12-05 📦 nvlabs/vila 📦 efficient-large-model/vila
9 Oryx-1.5(7B) Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution 81.80 2024-09-19 📦 oryx-mllm/oryx
10 Qwen2-VL(7B) Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution 81.20 2024-09-18 📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA

All Papers (47)