ML Research Wiki / Benchmarks / Video Question Answering / TVBench

TVBench

Video Question Answering Benchmark

Performance Over Time

📊 Showing 28 results | 📏 Metric: Average Accuracy

Top Performing Models

Rank	Model	Paper	Average Accuracy	Date	Code
1	Seed1.5-VL thinking	Seed1.5-VL Technical Report	63.60	2025-05-11	-
2	PLM-8B	PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	63.50	2025-04-17	📦 facebookresearch/perception_models
3	Seed1.5-VL	Seed1.5-VL Technical Report	61.50	2025-05-11	-
4	V-JEPA 2 ViT-g 8B	V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning	60.60	2025-06-11	📦 facebookresearch/vjepa2
5	PLM-3B	PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	58.90	2025-04-17	📦 facebookresearch/perception_models
6	RRPO	Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization	56.50	2025-04-16	-
7	Tarsier-34B	Tarsier: Recipes for Training and Evaluating Large Video Description Models	55.50	2024-06-30	📦 bytedance/tarsier
8	Tarsier2-7B	Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding	54.70	2025-01-14	📦 bytedance/tarsier
9	Qwen2-VL-72B	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	52.70	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
10	IXC-2.5 7B	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	51.60	2024-07-03	📦 internlm/internlm-xcomposer

All Papers (28)

Seed1.5-VL Technical Report

2025

Seed1.5-VL thinking

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-8B

facebookresearch/perception_models

Seed1.5-VL Technical Report

2025

Seed1.5-VL

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

2025

V-JEPA 2 ViT-g 8B

facebookresearch/vjepa2

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-3B

facebookresearch/perception_models

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

2025

RRPO

Tarsier: Recipes for Training and Evaluating Large Video Description Models

2024

Tarsier-34B

bytedance/tarsier

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

2025

Tarsier2-7B

bytedance/tarsier

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-72B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

2024

IXC-2.5 7B

internlm/internlm-xcomposer

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024

Aria

rhymes-ai/aria

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

2025

PLM-1B

facebookresearch/perception_models

Video Instruction Tuning With Synthetic Data

2024

LLaVA-Video 72B

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2 72B

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

2024

Gemini 1.5 Pro

dlvuldet/primevul

Tarsier: Recipes for Training and Evaluating Large Video Description Models

2024

Tarsier-7B

bytedance/tarsier

Video Instruction Tuning With Synthetic Data

2024

LLaVA-Video 7B

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-7B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2 7B

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

2024

PLLaVA-34B

magic-research/PLLaVA

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

2024

mPLUG-Owl3

x-plug/mplug-owl

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2.1

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

2024

VideoGPT+

mbzuai-oryx/videogpt-plus

GPT-4o System Card

2024

GPT4o 8 frames

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

2024

PLLaVA-13B

magic-research/PLLaVA

ST-LLM: Large Language Models Are Effective Temporal Learners

2024

ST-LLM

TencentARC/ST-LLM

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

2024

PLLaVA-7B

magic-research/PLLaVA

TVBench

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (28)

Seed1.5-VL Technical Report

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Seed1.5-VL Technical Report

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Aria: An Open Multimodal Native Mixture-of-Experts Model

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Video Instruction Tuning With Synthetic Data

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Video Instruction Tuning With Synthetic Data

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

GPT-4o System Card

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

ST-LLM: Large Language Models Are Effective Temporal Learners

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Model	Paper	Average Accuracy	Date
Seed1.5-VL thinking	Seed1.5-VL Technical Report	63.60	2025-05-11
PLM-8B	PerceptionLM: Open-Access Data and Models for Det…	63.50	2025-04-17
Seed1.5-VL	Seed1.5-VL Technical Report	61.50	2025-05-11
V-JEPA 2 ViT-g 8B	V-JEPA 2: Self-Supervised Video Models Enable Und…	60.60	2025-06-11
PLM-3B	PerceptionLM: Open-Access Data and Models for Det…	58.90	2025-04-17
RRPO	Self-alignment of Large Video Language Models wit…	56.50	2025-04-16
Tarsier-34B	Tarsier: Recipes for Training and Evaluating Larg…	55.50	2024-06-30
Tarsier2-7B	Tarsier2: Advancing Large Vision-Language Models …	54.70	2025-01-14
Qwen2-VL-72B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	52.70	2024-09-18
IXC-2.5 7B	InternLM-XComposer-2.5: A Versatile Large Vision …	51.60	2024-07-03
Aria	Aria: An Open Multimodal Native Mixture-of-Expert…	51.00	2024-10-08
PLM-1B	PerceptionLM: Open-Access Data and Models for Det…	50.40	2025-04-17
LLaVA-Video 72B	Video Instruction Tuning With Synthetic Data	50.00	2024-10-03
VideoLLaMA2 72B	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	48.40	2024-06-11
Gemini 1.5 Pro	Gemini 1.5: Unlocking multimodal understanding ac…	47.60	2024-03-08
Tarsier-7B	Tarsier: Recipes for Training and Evaluating Larg…	46.90	2024-06-30
LLaVA-Video 7B	Video Instruction Tuning With Synthetic Data	45.60	2024-10-03
Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	43.80	2024-09-18
VideoLLaMA2 7B	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	42.90	2024-06-11
PLLaVA-34B	PLLaVA : Parameter-free LLaVA Extension from Imag…	42.30	2024-04-25
mPLUG-Owl3	mPLUG-Owl3: Towards Long Image-Sequence Understan…	42.20	2024-08-09
VideoLLaMA2.1	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	42.10	2024-06-11
VideoGPT+	VideoGPT+: Integrating Image and Video Encoders f…	41.70	2024-06-13
GPT4o 8 frames	GPT-4o System Card	39.90	2024-10-25
PLLaVA-13B	PLLaVA : Parameter-free LLaVA Extension from Imag…	36.40	2024-04-25
ST-LLM	ST-LLM: Large Language Models Are Effective Tempo…	35.70	2024-03-30
VideoChat2	MVBench: A Comprehensive Multi-modal Video Unders…	35.00	2023-11-28
PLLaVA-7B	PLLaVA : Parameter-free LLaVA Extension from Imag…	34.90	2024-04-25