ML Research Wiki / Benchmarks / Video Question Answering / MVBench

MVBench

Video Question Answering Benchmark

Performance Over Time

📊 Showing 22 results | 📏 Metric: Avg.

Top Performing Models

Rank	Model	Paper	Avg.	Date	Code
1	LinVT-Qwen2-VL (7B)	LinVT: Empower Your Image-level Large Language Model to Understand Videos	69.30	2024-12-06	📦 gls0425/linvt
2	Tarsier (34B)	Tarsier: Recipes for Training and Evaluating Large Video Description Models	67.60	2024-06-30	📦 bytedance/tarsier
3	InternVideo2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	67.20	2024-03-22	📦 opengvlab/internvideo 📦 opengvlab/internvideo2
4	LongVU (7B)	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	66.90	2024-10-22	📦 Vision-CAIR/LongVU
5	Oryx(34B)	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	64.70	2024-09-19	📦 oryx-mllm/oryx
6	VideoLLaMA2 (72B)	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	62.00	2024-06-11	📦 damo-nlp-sg/videollama2 📦 damo-nlp-sg/videollama3 📦 damo-nlp-sg/inf-clip
7	VideoChat-T (7B)	TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning	59.90	2024-10-25	📦 OpenGVLab/TimeSuite
8	mPLUG-Owl3(7B)	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	59.50	2024-08-09	📦 x-plug/mplug-owl
9	PPLLaVA (7b)	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	59.20	2024-11-04	📦 farewellthree/ppllava
10	VideoGPT+	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	58.70	2024-06-13	📦 mbzuai-oryx/videogpt-plus

All Papers (22)

LinVT: Empower Your Image-level Large Language Model to Understand Videos

2024

LinVT-Qwen2-VL (7B)

gls0425/linvt

Tarsier: Recipes for Training and Evaluating Large Video Description Models

2024

Tarsier (34B)

bytedance/tarsier

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024

InternVideo2

opengvlab/internvideo opengvlab/internvideo2

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

2024

LongVU (7B)

Vision-CAIR/LongVU

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

2024

Oryx(34B)

oryx-mllm/oryx

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

2024

VideoLLaMA2 (72B)

damo-nlp-sg/videollama2 damo-nlp-sg/videollama3 damo-nlp-sg/inf-clip

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

2024

VideoChat-T (7B)

OpenGVLab/TimeSuite

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

2024

mPLUG-Owl3(7B)

x-plug/mplug-owl

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

2024

PPLLaVA (7b)

farewellthree/ppllava

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

2024

VideoGPT+

mbzuai-oryx/videogpt-plus

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

2024

PLLaVA

magic-research/PLLaVA

ST-LLM: Large Language Models Are Effective Temporal Learners

2024

ST-LLM

TencentARC/ST-LLM

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2023

VideoChat2

opengvlab/ask-anything magic-research/PLLaVA bytedance/tarsier

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

2024

HawkEye

yellow-binary-tree/hawkeye

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

2024

SPHINX-Plus

alpha-vllm/llama2-accessory

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

2023

TimeChat

renshuhuai-andy/timechat lntzm/cvpr24track-longvideo

Visual Instruction Tuning

2023

LLaVa

huggingface/transformers haotian-liu/LLaVA

VideoChat: Chat-Centric Video Understanding

2023

VideoChat

opengvlab/ask-anything

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

2023

VideoLLaMA

damo-nlp-sg/video-llama damo-nlp-sg/videollama2

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

2023

Video-ChatGPT

mbzuai-oryx/video-chatgpt qiujihao19/artemis

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

2023

InstructBLIP

salesforce/lavis tabtoyou/kollava

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

2023

MiniGPT4

vision-cair/minigpt-4 zyang1580/binllm

MVBench

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (22)

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Tarsier: Recipes for Training and Evaluating Large Video Description Models

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

ST-LLM: Large Language Models Are Effective Temporal Learners

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Visual Instruction Tuning

VideoChat: Chat-Centric Video Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Model	Paper	Avg.	Date
LinVT-Qwen2-VL (7B)	LinVT: Empower Your Image-level Large Language Mo…	69.30	2024-12-06
Tarsier (34B)	Tarsier: Recipes for Training and Evaluating Larg…	67.60	2024-06-30
InternVideo2	InternVideo2: Scaling Foundation Models for Multi…	67.20	2024-03-22
LongVU (7B)	LongVU: Spatiotemporal Adaptive Compression for L…	66.90	2024-10-22
Oryx(34B)	Oryx MLLM: On-Demand Spatial-Temporal Understandi…	64.70	2024-09-19
VideoLLaMA2 (72B)	VideoLLaMA 2: Advancing Spatial-Temporal Modeling…	62.00	2024-06-11
VideoChat-T (7B)	TimeSuite: Improving MLLMs for Long Video Underst…	59.90	2024-10-25
mPLUG-Owl3(7B)	mPLUG-Owl3: Towards Long Image-Sequence Understan…	59.50	2024-08-09
PPLLaVA (7b)	PPLLaVA: Varied Video Sequence Understanding With…	59.20	2024-11-04
VideoGPT+	VideoGPT+: Integrating Image and Video Encoders f…	58.70	2024-06-13
PLLaVA	PLLaVA : Parameter-free LLaVA Extension from Imag…	58.10	2024-04-25
ST-LLM	ST-LLM: Large Language Models Are Effective Tempo…	54.90	2024-03-30
VideoChat2	MVBench: A Comprehensive Multi-modal Video Unders…	51.90	2023-11-28
HawkEye	HawkEye: Training Video-Text LLMs for Grounding T…	47.55	2024-03-15
SPHINX-Plus	SPHINX-X: Scaling Data and Parameters for a Famil…	39.70	2024-02-08
TimeChat	TimeChat: A Time-sensitive Multimodal Large Langu…	38.50	2023-12-04
LLaVa	Visual Instruction Tuning	36.00	2023-04-17
VideoChat	VideoChat: Chat-Centric Video Understanding	35.50	2023-05-10
VideoLLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual La…	34.10	2023-06-05
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understandi…	32.70	2023-06-08
InstructBLIP	InstructBLIP: Towards General-purpose Vision-Lang…	32.50	2023-05-11
MiniGPT4	MiniGPT-4: Enhancing Vision-Language Understandin…	18.80	2023-04-20