TVBench

Dataset Information
Modalities
Videos, Texts
Introduced
2024
License
Homepage

Overview

TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative.

We defined 10 temporally challenging tasks that either require repetition counting (Action Count), properties about moving objects (Object Shuffle, Object Count, Moving Direction), temporal localization (Action Localization, Unexpected Action), temporal sequential ordering (Action Sequence, Scene Transition, Egocentric Sequence) and distinguishing between temporally hard Action Antonyms such as "Standing up" and "Sitting down".

Variants: TVBench

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Video Question Answering V-JEPA 2 ViT-g 8B V-JEPA 2: Self-Supervised Video Models … 2025-06-11
Video Question Answering Seed1.5-VL thinking Seed1.5-VL Technical Report 2025-05-11
Video Question Answering Seed1.5-VL Seed1.5-VL Technical Report 2025-05-11
Video Question Answering PLM-1B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering PLM-3B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering PLM-8B PerceptionLM: Open-Access Data and Models … 2025-04-17
Video Question Answering RRPO Self-alignment of Large Video Language … 2025-04-16
Video Question Answering Tarsier2-7B Tarsier2: Advancing Large Vision-Language Models … 2025-01-14
Video Question Answering GPT4o 8 frames GPT-4o System Card 2024-10-25
Video Question Answering Aria Aria: An Open Multimodal Native … 2024-10-08
Video Question Answering LLaVA-Video 72B Video Instruction Tuning With Synthetic … 2024-10-03
Video Question Answering LLaVA-Video 7B Video Instruction Tuning With Synthetic … 2024-10-03
Video Question Answering Qwen2-VL-7B Qwen2-VL: Enhancing Vision-Language Model's Perception … 2024-09-18
Video Question Answering Qwen2-VL-72B Qwen2-VL: Enhancing Vision-Language Model's Perception … 2024-09-18
Video Question Answering mPLUG-Owl3 mPLUG-Owl3: Towards Long Image-Sequence Understanding … 2024-08-09
Video Question Answering IXC-2.5 7B InternLM-XComposer-2.5: A Versatile Large Vision … 2024-07-03
Video Question Answering Tarsier-7B Tarsier: Recipes for Training and … 2024-06-30
Video Question Answering Tarsier-34B Tarsier: Recipes for Training and … 2024-06-30
Video Question Answering VideoGPT+ VideoGPT+: Integrating Image and Video … 2024-06-13
Video Question Answering VideoLLaMA2 7B VideoLLaMA 2: Advancing Spatial-Temporal Modeling … 2024-06-11

Research Papers

Recent papers with results on this dataset: