ML Research Wiki / Benchmarks / Video Question Answering / OVBench

OVBench

Video Question Answering Benchmark

Performance Over Time

📊 Showing 15 results | 📏 Metric: AVG

Top Performing Models

Rank	Model	Paper	AVG	Date	Code
1	Seed1.5-VL	Seed1.5-VL Technical Report	60.00	2025-05-11	-
2	VideoChat-Online (4B)	Online Video Understanding: OVBench and VideoChat-Online	54.90	2024-12-31	📦 MCG-NJU/VideoChat-Online
3	Gemini-1.5-Flash	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	50.70	2024-03-08	📦 dlvuldet/primevul
4	Qwen2-VL (7B)	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	49.70	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
5	LLaVA-OneVision (7B)	LLaVA-OneVision: Easy Visual Task Transfer	49.50	2024-08-06	📦 evolvinglmms-lab/lmms-eval 📦 MindSpore-scientific-2/code-14
6	InternVL2 (7B)	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	48.70	2024-12-06	📦 opengvlab/internvl
7	InternVL2 (4B)	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	44.10	2024-12-06	📦 opengvlab/internvl
8	LongVA (7B)	Long Context Transfer from Language to Vision	43.60	2024-06-24	📦 jzhang38/EasyContext 📦 evolvinglmms-lab/longva
9	LLaMA-VID (7B)	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	41.90	2023-11-28	📦 lastmile-ai/aiconfig 📦 dvlab-research/llama-vid
10	VTimeLLM (7B)	VTimeLLM: Empower LLM to Grasp Video Moments	33.10	2023-11-30	📦 huangb23/vtimellm

All Papers (15)

Seed1.5-VL Technical Report

2025

Seed1.5-VL

Online Video Understanding: OVBench and VideoChat-Online

2024

VideoChat-Online (4B)

MCG-NJU/VideoChat-Online

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

2024

Gemini-1.5-Flash

dlvuldet/primevul

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL (7B)

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OneVision (7B)

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

2024

InternVL2 (7B)

opengvlab/internvl

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

2024

InternVL2 (4B)

opengvlab/internvl

Long Context Transfer from Language to Vision

2024

LongVA (7B)

jzhang38/EasyContext evolvinglmms-lab/longva

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

2023

LLaMA-VID (7B)

lastmile-ai/aiconfig dvlab-research/llama-vid

VTimeLLM: Empower LLM to Grasp Video Moments

2023

VTimeLLM (7B)

huangb23/vtimellm

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

2024

Flash-Vstream (7B)

IVGSZ/Flash-VStream

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

2023

MovieChat (7B)

rese1f/MovieChat

LITA: Language Instructed Temporal-Localization Assistant

2024

LITA (7B)

nvlabs/lita

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

2023

TimeChat (7B)

renshuhuai-andy/timechat lntzm/cvpr24track-longvideo

VideoLLM-online: Online Video Large Language Model for Streaming Video

2024

VideoLLM-Online (7B)

OVBench

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (15)

Seed1.5-VL Technical Report

Online Video Understanding: OVBench and VideoChat-Online

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

LLaVA-OneVision: Easy Visual Task Transfer

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Long Context Transfer from Language to Vision

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

VTimeLLM: Empower LLM to Grasp Video Moments

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

LITA: Language Instructed Temporal-Localization Assistant

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

VideoLLM-online: Online Video Large Language Model for Streaming Video

Model	Paper	AVG	Date
Seed1.5-VL	Seed1.5-VL Technical Report	60.00	2025-05-11
VideoChat-Online (4B)	Online Video Understanding: OVBench and VideoChat…	54.90	2024-12-31
Gemini-1.5-Flash	Gemini 1.5: Unlocking multimodal understanding ac…	50.70	2024-03-08
Qwen2-VL (7B)	Qwen2-VL: Enhancing Vision-Language Model's Perce…	49.70	2024-09-18
LLaVA-OneVision (7B)	LLaVA-OneVision: Easy Visual Task Transfer	49.50	2024-08-06
InternVL2 (7B)	Expanding Performance Boundaries of Open-Source M…	48.70	2024-12-06
InternVL2 (4B)	Expanding Performance Boundaries of Open-Source M…	44.10	2024-12-06
LongVA (7B)	Long Context Transfer from Language to Vision	43.60	2024-06-24
LLaMA-VID (7B)	LLaMA-VID: An Image is Worth 2 Tokens in Large La…	41.90	2023-11-28
VTimeLLM (7B)	VTimeLLM: Empower LLM to Grasp Video Moments	33.10	2023-11-30
Flash-Vstream (7B)	Flash-VStream: Memory-Based Real-Time Understandi…	31.20	2024-06-12
MovieChat (7B)	MovieChat: From Dense Token to Sparse Memory for …	30.90	2023-07-31
LITA (7B)	LITA: Language Instructed Temporal-Localization A…	20.40	2024-03-27
TimeChat (7B)	TimeChat: A Time-sensitive Multimodal Large Langu…	12.80	2023-12-04
VideoLLM-Online (7B)	VideoLLM-online: Online Video Large Language Mode…	9.60	2024-06-17