ML Research Wiki / Benchmarks / Visual Question Answering / BenchLMM

BenchLMM

Visual Question Answering Benchmark

Performance Over Time

📊 Showing 10 results | 📏 Metric: GPT-3.5 score

Top Performing Models

Rank	Model	Paper	GPT-3.5 score	Date	Code
1	GPT-4V 📚	GPT-4 Technical Report	58.37	2023-03-15	📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
2	Sphinx-V2-1K 📚	SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	57.43	2023-11-13	📦 alpha-vllm/llama2-accessory
3	LLaVA-1.5-13B	Improved Baselines with Visual Instruction Tuning	55.53	2023-10-05	📦 huggingface/transformers 📦 haotian-liu/LLaVA 📦 LLaVA-VL/LLaVA-NeXT
4	LLaVA-1.5-7B	Visual Instruction Tuning	46.83	2023-04-17	📦 huggingface/transformers 📦 haotian-liu/LLaVA 📦 LLaVA-VL/LLaVA-NeXT
5	InstructBLIP-13B	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	45.03	2023-05-11	📦 salesforce/lavis 📦 tabtoyou/kollava 📦 pwc-1/Paper-9 📦 MS-P3/code3
6	InstructBLIP-7B	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	44.63	2023-05-11	📦 salesforce/lavis 📦 tabtoyou/kollava 📦 pwc-1/Paper-9 📦 MS-P3/code3
7	LLaVA-1-13B	Visual Instruction Tuning	43.50	2023-04-17	📦 huggingface/transformers 📦 haotian-liu/LLaVA 📦 LLaVA-VL/LLaVA-NeXT
8	Otter-7B	Otter: A Multi-Modal Model with In-Context Instruction Tuning	39.13	2023-05-05	📦 luodian/otter
9	MiniGPT4-13B	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	34.93	2023-04-20	📦 vision-cair/minigpt-4 📦 zyang1580/binllm 📦 2024-MindSpore-1/Code6
10	MiniGPTv2-7B	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	30.10	2023-10-14	📦 vision-cair/minigpt-4 📦 zebangcheng/emotion-llama

All Papers (10)

GPT-4 Technical Report

2023

GPT-4V

openai/evals shmsw25/factscore

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

2023

Sphinx-V2-1K

alpha-vllm/llama2-accessory

Improved Baselines with Visual Instruction Tuning

2023

LLaVA-1.5-13B

huggingface/transformers haotian-liu/LLaVA

Visual Instruction Tuning

2023

LLaVA-1.5-7B

huggingface/transformers haotian-liu/LLaVA

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

2023

InstructBLIP-13B

salesforce/lavis tabtoyou/kollava

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

2023

InstructBLIP-7B

salesforce/lavis tabtoyou/kollava

Visual Instruction Tuning

2023

LLaVA-1-13B

huggingface/transformers haotian-liu/LLaVA

Otter: A Multi-Modal Model with In-Context Instruction Tuning

2023

Otter-7B

luodian/otter

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

2023

MiniGPT4-13B

vision-cair/minigpt-4 zyang1580/binllm

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

2023

MiniGPTv2-7B

vision-cair/minigpt-4 zebangcheng/emotion-llama

BenchLMM

Performance Over Time

Edit Benchmark Results

Edit Result

Top Performing Models

All Papers (10)

GPT-4 Technical Report

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Improved Baselines with Visual Instruction Tuning

Visual Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Visual Instruction Tuning

Otter: A Multi-Modal Model with In-Context Instruction Tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Model	Paper	GPT-3.5 score	Date
GPT-4V	GPT-4 Technical Report	58.37	2023-03-15
Sphinx-V2-1K	SPHINX: The Joint Mixing of Weights, Tasks, and V…	57.43	2023-11-13
LLaVA-1.5-13B	Improved Baselines with Visual Instruction Tuning	55.53	2023-10-05
LLaVA-1.5-7B	Visual Instruction Tuning	46.83	2023-04-17
InstructBLIP-13B	InstructBLIP: Towards General-purpose Vision-Lang…	45.03	2023-05-11
InstructBLIP-7B	InstructBLIP: Towards General-purpose Vision-Lang…	44.63	2023-05-11
LLaVA-1-13B	Visual Instruction Tuning	43.50	2023-04-17
Otter-7B	Otter: A Multi-Modal Model with In-Context Instru…	39.13	2023-05-05
MiniGPT4-13B	MiniGPT-4: Enhancing Vision-Language Understandin…	34.93	2023-04-20
MiniGPTv2-7B	MiniGPT-v2: large language model as a unified int…	30.10	2023-10-14