NExT-QA

Dataset Information

Modalities

Videos, Texts, Actions

Languages

English

Introduced

2021

License

MIT

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities, e.g., "why is the boy crying?" and "How does the lady react after the boy fall backward?". It supports both multi-choice and generative open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.

Variants: NExT-QA, NExT-GQA, NExT-QA (Open-ended VideoQA)

Associated Benchmarks

This dataset is used in 1 benchmark:

Video Question Answering - Metrics: Accuracy

Recent Benchmark Submissions

Task	Model	Paper	Date
Video Question Answering	PLM-8B	PerceptionLM: Open-Access Data and Models …	2025-04-17
Video Question Answering	PLM-3B	PerceptionLM: Open-Access Data and Models …	2025-04-17
Video Question Answering	PLM-1B	PerceptionLM: Open-Access Data and Models …	2025-04-17
Video Question Answering	BIMBA-LLaVA-Qwen2-7B	BIMBA: Selective-Scan Compression for Long-Range …	2025-03-12
Video Question Answering	VideoLLaMA3(7B)	VideoLLaMA 3: Frontier Multimodal Foundation …	2025-01-22
Video Question Answering	LinVT-Qwen2-VL (7B)	LinVT: Empower Your Image-level Large …	2024-12-06
Video Question Answering	InternVL-2.5(8B)	Expanding Performance Boundaries of Open-Source …	2024-12-06
Video Question Answering	NVILA(8B)	NVILA: Efficient Frontier Visual Language …	2024-12-05
Video Question Answering	LLaVA-Video	Video Instruction Tuning With Synthetic …	2024-10-03
Video Question Answering	Oryx-1.5(7B)	Oryx MLLM: On-Demand Spatial-Temporal Understanding …	2024-09-19
Video Question Answering	Qwen2-VL(7B)	Qwen2-VL: Enhancing Vision-Language Model's Perception …	2024-09-18
Video Question Answering	LongVILA(7B)	LongVILA: Scaling Long-Context Visual Language …	2024-08-19
Video Question Answering	mPLUG-Owl3(8B)	mPLUG-Owl3: Towards Long Image-Sequence Understanding …	2024-08-09
Video Question Answering	LLaVA-OV(7B)	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06
Video Question Answering	LLaVA-OV(72B)	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06
Video Question Answering	LLaVA-NeXT-Interleave(DPO)	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and …	2024-07-10
Video Question Answering	LLaVA-NeXT-Interleave(7B)	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and …	2024-07-10
Video Question Answering	LLaVA-NeXT-Interleave(14B)	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and …	2024-07-10
Video Question Answering	VideoLLaMA2.1(7B)	VideoLLaMA 2: Advancing Spatial-Temporal Modeling …	2024-06-11
Video Question Answering	LSTP	Efficient Temporal Extrapolation of Multimodal …	2024-02-25

Research Papers

Recent papers with results on this dataset:

External Links:

NExT-QA

Overview edit

Associated Benchmarks

Recent Benchmark Submissions

Research Papers

Edit Dataset Information

Overview