ML Research Wiki / Benchmarks / Visual Question Answering / MM-Vet v2

MM-Vet v2

Visual Question Answering Benchmark

Performance Over Time

📊 Showing 17 results | 📏 Metric: GPT-4 score

Top Performing Models

Rank	Model	Paper	Date	Code
1	GPT-4o (gpt-4o-2024-11-20)	GPT-4 Technical Report	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
2	GPT-4o (gpt-4o-2024-05-13)	GPT-4 Technical Report	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
3	Gemini 1.5 Pro	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	2024-03-08	📦 dlvuldet/primevul
4	Qwen2-VL-72B (qwen-vl-max-0809)	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	2024-09-18	📦 qwenlm/qwen2-vl 📦 qwenlm/qwen2.5-vl 📦 juruobenruo/DexVLA
5	gpt-4o-mini-2024-07-18	GPT-4 Technical Report	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
6	GPT-4 Turbo (gpt-4-0125-preview)	GPT-4 Technical Report	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
7	Gemini Pro Vision	Gemini: A Family of Highly Capable Multimodal Models	2023-12-19	📦 valdecy/pybibx
8	Qwen-VL-Max	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	2023-08-24	📦 qwenlm/qwen-vl 📦 brandon3964/multimodal-task-vector
9	InternVL-Chat-V1-5	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	2024-04-25	📦 opengvlab/internvl
10	CogVLM-Chat	CogVLM: Visual Expert for Pretrained Language Models	2023-11-06	📦 thudm/cogvlm 📦 THUDM/CogAgent 📦 2024-MindSpore-1/Code2 📦 MS-P3/code5

All Papers (17)

GPT-4 Technical Report

2023

GPT-4o (gpt-4o-2024-11-20)

openai/evals shmsw25/factscore

GPT-4 Technical Report

2023

GPT-4o (gpt-4o-2024-05-13)

openai/evals shmsw25/factscore

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

2024

Gemini 1.5 Pro

dlvuldet/primevul

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

2024

Qwen2-VL-72B (qwen-vl-max-0809)

qwenlm/qwen2-vl qwenlm/qwen2.5-vl

GPT-4 Technical Report

2023

gpt-4o-mini-2024-07-18

openai/evals shmsw25/factscore

GPT-4 Technical Report

2023

GPT-4 Turbo (gpt-4-0125-preview)

openai/evals shmsw25/factscore

Gemini: A Family of Highly Capable Multimodal Models

2023

Gemini Pro Vision

valdecy/pybibx

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

2023

Qwen-VL-Max

qwenlm/qwen-vl brandon3964/multimodal-task-vector

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

2024

InternVL-Chat-V1-5

opengvlab/internvl

CogVLM: Visual Expert for Pretrained Language Models

2023

CogVLM-Chat

thudm/cogvlm THUDM/CogAgent

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

2024

IXC2-VL-7B

internlm/internlm-xcomposer

Generative Multimodal Models are In-Context Learners

2023

Emu2-Chat

baaivision/emu

CogAgent: A Visual Language Model for GUI Agents

2023

CogAgent-Chat

thudm/cogvlm THUDM/CogAgent digirl-agent/digirl

Improved Baselines with Visual Instruction Tuning

2023

LLaVA-v1.5-13B

huggingface/transformers haotian-liu/LLaVA

Improved Baselines with Visual Instruction Tuning

2023

LLaVA-v1.5-7B

huggingface/transformers haotian-liu/LLaVA

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

2023

Otter-9B

luodian/otter One-2-3-45/One-2-3-45

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

2023

OpenFlamingo-9B

mlfoundations/open_flamingo luodian/otter

MM-Vet v2

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (17)

GPT-4 Technical Report

GPT-4 Technical Report

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

GPT-4 Technical Report

GPT-4 Technical Report

Gemini: A Family of Highly Capable Multimodal Models

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

CogVLM: Visual Expert for Pretrained Language Models

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Generative Multimodal Models are In-Context Learners

CogAgent: A Visual Language Model for GUI Agents

Improved Baselines with Visual Instruction Tuning

Improved Baselines with Visual Instruction Tuning

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Model	Paper	Date
GPT-4o (gpt-4o-2024-11-20)	GPT-4 Technical Report	2023-03-15
GPT-4o (gpt-4o-2024-05-13)	GPT-4 Technical Report	2023-03-15
Gemini 1.5 Pro	Gemini 1.5: Unlocking multimodal understanding ac…	2024-03-08
Qwen2-VL-72B (qwen-vl-max-0809)	Qwen2-VL: Enhancing Vision-Language Model's Perce…	2024-09-18
gpt-4o-mini-2024-07-18	GPT-4 Technical Report	2023-03-15
GPT-4 Turbo (gpt-4-0125-preview)	GPT-4 Technical Report	2023-03-15
Gemini Pro Vision	Gemini: A Family of Highly Capable Multimodal Mod…	2023-12-19
Qwen-VL-Max	Qwen-VL: A Versatile Vision-Language Model for Un…	2023-08-24
InternVL-Chat-V1-5	How Far Are We to GPT-4V? Closing the Gap to Comm…	2024-04-25
CogVLM-Chat	CogVLM: Visual Expert for Pretrained Language Mod…	2023-11-06
IXC2-VL-7B	InternLM-XComposer2: Mastering Free-form Text-Ima…	2024-01-29
Emu2-Chat	Generative Multimodal Models are In-Context Learn…	2023-12-20
CogAgent-Chat	CogAgent: A Visual Language Model for GUI Agents	2023-12-14
LLaVA-v1.5-13B	Improved Baselines with Visual Instruction Tuning	2023-10-05
LLaVA-v1.5-7B	Improved Baselines with Visual Instruction Tuning	2023-10-05
Otter-9B	MIMIC-IT: Multi-Modal In-Context Instruction Tuni…	2023-06-08
OpenFlamingo-9B	OpenFlamingo: An Open-Source Framework for Traini…	2023-08-02