ML Research Wiki / Benchmarks / Visual Question Answering (VQA) / InfiMM-Eval

InfiMM-Eval

Visual Question Answering (VQA) Benchmark

Performance Over Time

📊 Showing 14 results | 📏 Metric: Overall score

Top Performing Models

Rank Model Paper Overall score Date Code
1 GPT-4V GPT-4 Technical Report 77.88 2023-03-15 📦 openai/evals 📦 shmsw25/factscore 📦 unispac/visual-adversarial-examples-jailbreak-large-language-models
2 SPHINX v2 SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models 49.85 2023-11-13 📦 alpha-vllm/llama2-accessory
3 LLaVA-1.5 Improved Baselines with Visual Instruction Tuning 47.91 2023-10-05 📦 huggingface/transformers 📦 haotian-liu/LLaVA 📦 LLaVA-VL/LLaVA-NeXT
4 CogVLM-Chat CogVLM: Visual Expert for Pretrained Language Models 47.88 2023-11-06 📦 thudm/cogvlm 📦 THUDM/CogAgent 📦 2024-MindSpore-1/Code2 📦 MS-P3/code5
5 LLaMA-Adapter V2 LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 46.12 2023-04-28 📦 opengvlab/llama-adapter 📦 zrrskywalker/llama-adapter 📦 Mind23-2/MindCode-140
6 Qwen-VL-Chat Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 44.39 2023-08-24 📦 qwenlm/qwen-vl 📦 brandon3964/multimodal-task-vector
7 InstructBLIP InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning 37.76 2023-05-11 📦 salesforce/lavis 📦 tabtoyou/kollava 📦 pwc-1/Paper-9 📦 MS-P3/code3
8 Emu Emu: Generative Pretraining in Multimodality 36.57 2023-07-11 📦 baaivision/emu 📦 doc-doc/NExT-OE
9 InternLM-XComposer-VL InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition 35.97 2023-09-26 📦 internlm/internlm-xcomposer 📦 MindSpore-scientific-2/code-14 📦 MS-P3/code3
10 Otter Otter: A Multi-Modal Model with In-Context Instruction Tuning 33.64 2023-05-05 📦 luodian/otter

All Papers (14)