NExT-QA

Dataset Information
Modalities
Videos, Texts, Actions
Languages
English
Introduced
2021
License
MIT
Homepage

Overview

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities, e.g., "why is the boy crying?" and "How does the lady react after the boy fall backward?". It supports both multi-choice and generative open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.

Variants: NExT-QA, NExT-GQA, NExT-QA (Open-ended VideoQA)

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Video Question Answering PLM-8B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering PLM-3B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering PLM-1B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering BIMBA-LLaVA-Qwen2-7B BIMBA: Selective-Scan Compression for Long-Range … 2025-03-12
Video Question Answering VideoLLaMA3(7B) VideoLLaMA 3: Frontier Multimodal Foundation … 2025-01-22
Video Question Answering LinVT-Qwen2-VL (7B) LinVT: Empower Your Image-level Large … 2024-12-06
Video Question Answering InternVL-2.5(8B) Expanding Performance Boundaries of Open-Source … 2024-12-06
Video Question Answering NVILA(8B) NVILA: Efficient Frontier Visual Language … 2024-12-05
Video Question Answering LLaVA-Video Video Instruction Tuning With Synthetic … 2024-10-03
Video Question Answering Oryx-1.5(7B) Oryx MLLM: On-Demand Spatial-Temporal Understanding … 2024-09-19
Video Question Answering Qwen2-VL(7B) Qwen2-VL: Enhancing Vision-Language Model's Perception … 2024-09-18
Video Question Answering LongVILA(7B) LongVILA: Scaling Long-Context Visual Language … 2024-08-19
Video Question Answering mPLUG-Owl3(8B) mPLUG-Owl3: Towards Long Image-Sequence Understanding … 2024-08-09
Video Question Answering LLaVA-OV(7B) LLaVA-OneVision: Easy Visual Task Transfer 2024-08-06
Video Question Answering LLaVA-OV(72B) LLaVA-OneVision: Easy Visual Task Transfer 2024-08-06
Video Question Answering LLaVA-NeXT-Interleave(DPO) LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and … 2024-07-10
Video Question Answering LLaVA-NeXT-Interleave(7B) LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and … 2024-07-10
Video Question Answering LLaVA-NeXT-Interleave(14B) LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and … 2024-07-10
Video Question Answering VideoLLaMA2.1(7B) VideoLLaMA 2: Advancing Spatial-Temporal Modeling … 2024-06-11
Video Question Answering LSTP Efficient Temporal Extrapolation of Multimodal … 2024-02-25

Research Papers

Recent papers with results on this dataset: