ML Research Wiki / Benchmarks / Visual Question Answering (VQA) / VLM2-Bench

VLM2-Bench

Visual Question Answering (VQA) Benchmark

Performance Over Time

📊 Showing 9 results | 📏 Metric: GC-mat

Top Performing Models

Rank	Model	Paper	GC-mat	Date	Code
1	GPT-4o	GPT-4o System Card	37.45	2024-10-25	-
2	Qwen2.5-VL-7B	Qwen2.5-VL Technical Report	35.91	2025-02-19	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 likaixin2000/screenspot-pro-gui-grounding 📦 princeton-nlp/CharXiv
3	InternVL2.5-26B	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	30.50	2024-12-06	📦 opengvlab/internvl
4	Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	27.80	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
5	InternVL2.5-8B	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	21.24	2024-12-06	📦 opengvlab/internvl
6	LLaVA-Video-7B	Video Instruction Tuning With Synthetic Data	18.53	2024-10-03	-
7	mPLUG-Owl3-7B	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	17.37	2024-08-09	📦 x-plug/mplug-owl
8	LLaVA-OneVision-7B	LLaVA-OneVision: Easy Visual Task Transfer	16.60	2024-08-06	📦 evolvinglmms-lab/lmms-eval 📦 MindSpore-scientific-2/code-14
9	LongVA-7B	Long Context Transfer from Language to Vision	14.29	2024-06-24	📦 jzhang38/EasyContext 📦 evolvinglmms-lab/longva

All Papers (9)

GPT-4o System Card

2024

GPT-4o

Qwen2.5-VL Technical Report

2025

Qwen2.5-VL-7B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

2024

InternVL2.5-26B

opengvlab/internvl

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-7B

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

2024

InternVL2.5-8B

opengvlab/internvl

Video Instruction Tuning With Synthetic Data

2024

LLaVA-Video-7B

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

2024

mPLUG-Owl3-7B

x-plug/mplug-owl

LLaVA-OneVision: Easy Visual Task Transfer

2024

LLaVA-OneVision-7B

evolvinglmms-lab/lmms-eval MindSpore-scientific-2/code-14

Long Context Transfer from Language to Vision

2024

LongVA-7B

jzhang38/EasyContext evolvinglmms-lab/longva

VLM2-Bench

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (9)

GPT-4o System Card

Qwen2.5-VL Technical Report

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Video Instruction Tuning With Synthetic Data

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

LLaVA-OneVision: Easy Visual Task Transfer

Long Context Transfer from Language to Vision

Model	Paper	GC-mat	Date
GPT-4o	GPT-4o System Card	37.45	2024-10-25
Qwen2.5-VL-7B	Qwen2.5-VL Technical Report	35.91	2025-02-19
InternVL2.5-26B	Expanding Performance Boundaries of Open-Source M…	30.50	2024-12-06
Qwen2-VL-7B	Qwen2-VL: Enhancing Vision-Language Model's Perce…	27.80	2024-09-18
InternVL2.5-8B	Expanding Performance Boundaries of Open-Source M…	21.24	2024-12-06
LLaVA-Video-7B	Video Instruction Tuning With Synthetic Data	18.53	2024-10-03
mPLUG-Owl3-7B	mPLUG-Owl3: Towards Long Image-Sequence Understan…	17.37	2024-08-09
LLaVA-OneVision-7B	LLaVA-OneVision: Easy Visual Task Transfer	16.60	2024-08-06
LongVA-7B	Long Context Transfer from Language to Vision	14.29	2024-06-24