MMCTAgent (GPT-4 + GPT-4V)
→ MMCTAgent (GPT-4 + GPT-4V)
|
MMCTAgent: Multi-modal Critical Thinking Agent Fr…
|
74.24
→ 74.23
|
2024-05-28
→ 2024-05-28
|
Edit Pending
|
Qwen2-VL-72B
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
74.00
|
2024-09-18
|
|
InternVL2.5-78B
|
Expanding Performance Boundaries of Open-Source M…
|
72.30
|
2024-12-06
|
|
GPT-4o +text rationale +IoT
|
Image-of-Thought Prompting for Visual Reasoning R…
|
72.20
|
2024-05-22
|
|
Lyra-Pro
|
Lyra: An Efficient and Speech-Centric Framework f…
|
71.40
|
2024-12-12
|
|
GLM-4V-Plus
|
CogVLM2: Visual Language Models for Image and Vid…
|
71.10
|
2024-08-29
|
|
Phantom-7B
|
Phantom of Latent for Large Language and Vision M…
|
70.80
|
2024-09-23
|
|
InternVL2.5-38B
|
Expanding Performance Boundaries of Open-Source M…
|
68.80
|
2024-12-06
|
|
InternVL2-26B (SGP, token ratio 64%)
|
A Stitch in Time Saves Nine: Small VLM is a Preci…
|
65.60
|
2024-12-04
|
|
Baichuan-Omni (7B)
|
Baichuan-Omni Technical Report
|
65.40
|
2024-10-11
|
|
InternVL2.5-26B
|
Expanding Performance Boundaries of Open-Source M…
|
65.00
|
2024-12-06
|
|
Qwen2-VL-7B (finetuned on GAP-VQA train)
|
Gamified crowd-sourcing of high-quality data for …
|
64.95
|
2024-10-05
|
|
GLM4 Vision
|
CogVLM: Visual Expert for Pretrained Language Mod…
|
63.90
|
2023-11-06
|
|
LLaVA-OneVision-72B
|
LLaVA-OneVision: Easy Visual Task Transfer
|
63.70
|
2024-08-06
|
|
Lyra-Base
|
Lyra: An Efficient and Speech-Centric Framework f…
|
63.50
|
2024-12-12
|
|
InternVL2-26B (SGP, token ratio 35%)
|
A Stitch in Time Saves Nine: Small VLM is a Preci…
|
63.20
|
2024-12-04
|
|
InternVL 1.5
|
How Far Are We to GPT-4V? Closing the Gap to Comm…
|
62.80
|
2024-04-25
|
|
InternVL2.5-8B
|
Expanding Performance Boundaries of Open-Source M…
|
62.80
|
2024-12-06
|
|
MAmmoTH-VL-8B
|
MAmmoTH-VL: Eliciting Multimodal Reasoning with I…
|
62.30
|
2024-12-06
|
|
Qwen2-VL-7B
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
62.00
|
2024-09-18
|
|
Mini-Gemini-HD-BS
|
Mini-Gemini: Mining the Potential of Multi-modali…
|
60.80
|
2024-03-27
|
|
InternVL2.5-2B
|
Expanding Performance Boundaries of Open-Source M…
|
60.80
|
2024-12-06
|
|
MAmmoTH-VL-8B (SI)
|
MAmmoTH-VL: Eliciting Multimodal Reasoning with I…
|
60.60
|
2024-12-06
|
|
InternVL2.5-4B
|
Expanding Performance Boundaries of Open-Source M…
|
60.60
|
2024-12-06
|
|
Mini-Gemini-HD
|
Mini-Gemini: Mining the Potential of Multi-modali…
|
59.30
|
2024-03-27
|
|
GLM-4V-9B
|
CogVLM2: Visual Language Models for Image and Vid…
|
58.00
|
2024-08-29
|
|
LLaVA-OneVision-7B
|
LLaVA-OneVision: Easy Visual Task Transfer
|
57.50
|
2024-08-06
|
|
Meteor
|
Meteor: Mamba-based Traversal of Rationale for La…
|
57.30
|
2024-05-24
|
|
CROME (Vicuna-13B)
|
CROME: Cross-Modal Adapters for Efficient Multimo…
|
55.10
|
2024-08-13
|
|
IXC2-4KHD
|
InternLM-XComposer2-4KHD: A Pioneering Large Visi…
|
54.90
|
2024-04-09
|
|
TroL-7B
|
TroL: Traversal of Layers for Large Language and …
|
54.70
|
2024-06-18
|
|
Mini-Gemini
|
Mini-Gemini: Mining the Potential of Multi-modali…
|
53.00
|
2024-03-27
|
|
CogVLM(Vicuna-7B)
|
CogVLM: Visual Expert for Pretrained Language Mod…
|
52.80
|
2023-11-06
|
|
CogAgent
|
CogAgent: A Visual Language Model for GUI Agents
|
52.80
|
2023-12-14
|
|
Qwen2-VL-2B (finetuned on GAP-VQA train)
|
Gamified crowd-sourcing of high-quality data for …
|
52.43
|
2024-10-05
|
|
InternVL2-26B (SGP, token ratio 9%)
|
A Stitch in Time Saves Nine: Small VLM is a Preci…
|
52.10
|
2024-12-04
|
|
MM1.5-30B
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
52.00
|
2024-09-30
|
|
MiniCPM-Llama3-V-2.5-8B (finetuned on GAP-VQA train)
|
Gamified crowd-sourcing of high-quality data for …
|
51.79
|
2024-10-05
|
|
IXC-2.5-7B
|
InternLM-XComposer-2.5: A Versatile Large Vision …
|
51.70
|
2024-07-03
|
|
InternLM-XComposer2
|
InternLM-XComposer2: Mastering Free-form Text-Ima…
|
51.20
|
2024-01-29
|
|
Lyra-Mini
|
Lyra: An Efficient and Speech-Centric Framework f…
|
51.20
|
2024-12-12
|
|
CuMo-7B
|
CuMo: Scaling Multimodal LLM with Co-Upcycled Mix…
|
51.00
|
2024-05-09
|
|
TACO (Qwen2-7B / SigLIP)
|
TACO: Learning Multi-modal Action Models with Syn…
|
50.90
|
2024-12-07
|
|
Qwen-VL-Chat (+ SFT (GPT-4V in VLFeedback))
|
VLFeedback: A Large-Scale AI Feedback Dataset for…
|
50.70
|
2024-10-12
|
|
POINTS-9B
|
POINTS: Improving Your Vision-language Model with…
|
50.00
|
2024-09-07
|
|
VILA^2-8B
|
VILA$^2$: VILA Augmented VILA
|
50.00
|
2024-07-24
|
|
Janus-Pro-7B
|
Janus-Pro: Unified Multimodal Understanding and G…
|
50.00
|
2025-01-29
|
|
Silkie
|
Silkie: Preference Distillation for Large Visual …
|
49.90
|
2023-12-17
|
|
Silkie (Qwen-VL-Chat + DPO w/ VLFeedback)
|
VLFeedback: A Large-Scale AI Feedback Dataset for…
|
49.90
|
2024-10-12
|
|
Qwen2-VL-2B
|
Qwen2-VL: Enhancing Vision-Language Model's Perce…
|
49.50
|
2024-09-18
|
|
FlashSloth-HD
|
FlashSloth: Lightning Multimodal Large Language M…
|
49.00
|
2024-12-05
|
|
InternVL 1.2
|
How Far Are We to GPT-4V? Closing the Gap to Comm…
|
48.90
|
2024-04-25
|
|
SEA-PRIME (Vicuna-13B)
|
SEA: Supervised Embedding Alignment for Token-Lev…
|
48.80
|
2024-08-21
|
|
InternVL2.5-1B
|
Expanding Performance Boundaries of Open-Source M…
|
48.80
|
2024-12-06
|
|
MM1-30B-Chat
|
MM1: Methods, Analysis & Insights from Multimodal…
|
48.70
|
2024-03-14
|
|
SETOKIM (13B)
|
Towards Semantic Equivalence of Tokenization in M…
|
48.70
|
2024-06-07
|
|
Emu2-Chat
|
Generative Multimodal Models are In-Context Learn…
|
48.50
|
2023-12-20
|
|
MG-LLaVA(34B)
|
MG-LLaVA: Towards Multi-Granularity Visual Instru…
|
48.50
|
2024-06-25
|
|
SPHINX-Plus
|
SPHINX-X: Scaling Data and Parameters for a Famil…
|
47.90
|
2024-02-08
|
|
ConvLLaVA
|
ConvLLaVA: Hierarchical Backbones as Visual Encod…
|
45.90
|
2024-05-24
|
|
VILA-13B
|
VILA: On Pre-training for Visual Language Models
|
45.70
|
2023-12-12
|
|
TACO (LLaMA3-8B / SigLIP)
|
TACO: Learning Multi-modal Action Models with Syn…
|
45.70
|
2024-12-07
|
|
TACO (LLaMA3-8B / CLIP)
|
TACO: Learning Multi-modal Action Models with Syn…
|
45.20
|
2024-12-07
|
|
LLaVA-v1.6 (7B, w/ STIC)
|
Enhancing Large Vision Language Models with Self-…
|
45.00
|
2024-05-30
|
|
H2OVL-Mississippi-2B
|
H2OVL-Mississippi Vision Language Models Technica…
|
44.70
|
2024-10-17
|
|
PIIP-LLaVA (Vicuna-7B, ConvNeXt-L, CLIP-L )
|
Parameter-Inverted Image Pyramid Networks for Vis…
|
44.70
|
2025-01-14
|
|
Imp-4B
|
Imp: Highly Capable Large Multimodal Models for M…
|
44.60
|
2024-05-20
|
|
LLaVA-Next-Mistral-7b (+ DPO w/ VLFeedback)
|
VLFeedback: A Large-Scale AI Feedback Dataset for…
|
44.20
|
2024-10-12
|
|
MGM-7B+RP
|
Img-Diff: Contrastive Data Synthesis for Multimod…
|
44.10
|
2024-08-08
|
|
LLaVA-Next-Vicuna-7b (+ DPO w/ VLFeedback)
|
VLFeedback: A Large-Scale AI Feedback Dataset for…
|
44.10
|
2024-10-12
|
|
VW-LMM
|
Multi-modal Auto-regressive Modeling via Visual W…
|
44.00
|
2024-03-12
|
|
MoAI
|
MoAI: Mixture of All Intelligence for Large Langu…
|
43.70
|
2024-03-12
|
|
MM1-3B-Chat
|
MM1: Methods, Analysis & Insights from Multimodal…
|
43.70
|
2024-03-14
|
|
MM1.5-3B-MoE
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
43.70
|
2024-09-30
|
|
Imp-3B
|
Imp: Highly Capable Large Multimodal Models for M…
|
43.30
|
2024-05-20
|
|
ShareGPT4V-13B
|
ShareGPT4V: Improving Large Multi-Modal Models wi…
|
43.10
|
2023-11-21
|
|
Mini-Gemini (+MoCa)
|
Deciphering Cross-Modal Alignment in Large Vision…
|
42.90
|
2024-10-09
|
|
MM1.5-7B
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
42.20
|
2024-09-30
|
|
MM1-7B-Chat
|
MM1: Methods, Analysis & Insights from Multimodal…
|
42.10
|
2024-03-14
|
|
FlashSloth
|
FlashSloth: Lightning Multimodal Large Language M…
|
41.90
|
2024-12-05
|
|
DeepSeek-VL
|
DeepSeek-VL: Towards Real-World Vision-Language U…
|
41.50
|
2024-03-08
|
|
LaVA1.5-13B-BPO
|
Strengthening Multimodal Large Language Model wit…
|
41.40
|
2024-03-13
|
|
ASMv2
|
The All-Seeing Project V2: Towards General Relati…
|
41.30
|
2024-02-29
|
|
FocusLLaVA
|
FocusLLaVA: A Coarse-to-Fine Approach for Efficie…
|
41.30
|
2024-11-21
|
|
SeVa-13B
|
Self-Supervised Visual Preference Alignment
|
41.00
|
2024-04-16
|
|
MM1.5-3B
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
41.00
|
2024-09-30
|
|
LLaVA-1.5-7B (VG-S)
|
ProVision: Programmatically Scaling Vision-centri…
|
40.40
|
2024-12-09
|
|
CoLLaVO
|
CoLLaVO: Crayon Large Language and Vision mOdel
|
40.30
|
2024-02-17
|
|
SPHINX-2k
|
SPHINX: The Joint Mixing of Weights, Tasks, and V…
|
40.20
|
2023-11-13
|
|
LLaVA-1.5 (LVIS-Instrcut4V)
|
To See is to Believe: Prompting GPT-4V for Better…
|
40.20
|
2023-11-13
|
|
mPLUG-Owl3
|
mPLUG-Owl3: Towards Long Image-Sequence Understan…
|
40.10
|
2024-08-09
|
|
Mono-InternVL-2B
|
Mono-InternVL: Pushing the Boundaries of Monolith…
|
40.10
|
2024-10-10
|
|
LLaVA1.5-13B-MDA
|
Looking Beyond Text: Reducing Language bias in La…
|
39.90
|
2024-11-21
|
|
LLaVA-VT (Vicuna-13B)
|
Beyond Embeddings: The Promise of Visual Table in…
|
39.80
|
2024-03-27
|
|
MM1.5-1B-MoE
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
39.80
|
2024-09-30
|
|
Janus-Pro-1B
|
Janus-Pro: Unified Multimodal Understanding and G…
|
39.80
|
2025-01-29
|
|
SQ-LLaVA∗
|
SQ-LLaVA: Self-Questioning for Large Vision-Langu…
|
39.70
|
2024-03-17
|
|
OmniFusion (grid split + ruDocVQA)
|
OmniFusion Technical Report
|
39.40
|
2024-04-09
|
|
DeepStack-L-HD (Vicuna-13B)
|
DeepStack: Deeply Stacking Visual Tokens is Surpr…
|
39.30
|
2024-06-06
|
|
LAF-13B
|
From Training-Free to Adaptive: Empirical Insight…
|
38.90
|
2024-01-31
|
|
InfiMM-HD
|
InfiMM-HD: A Leap Forward in High-Resolution Mult…
|
38.90
|
2024-03-03
|
|
InternLM-XC2 + MMDU-45k
|
MMDU: A Multi-Turn Multi-Image Dialog Understandi…
|
38.80
|
2024-06-17
|
|
LLaVA-1.5-7B (DC-S)
|
ProVision: Programmatically Scaling Vision-centri…
|
38.50
|
2024-12-09
|
|
LayoutLMv3+ConvNeXt+CLIP
|
MouSi: Poly-Visual-Expert Vision-Language Models
|
38.40
|
2024-01-30
|
|
VOLCANO 13B
|
Volcano: Mitigating Multimodal Hallucination thro…
|
38.00
|
2023-11-13
|
|
LLaVA-1.5+MMInstruct (Vicuna-13B)
|
MMInstruct: A High-Quality Multi-Modal Instructio…
|
37.90
|
2024-07-22
|
|
LLaVA-1.5-13B (+CSR)
|
Calibrated Self-Rewarding Vision Language Models
|
37.80
|
2024-05-23
|
|
LLaVA-1.5-LLaMA3-8B
|
What If We Recaption Billions of Web Images with …
|
37.80
|
2024-06-12
|
|
LLaVA-1.5 + DenseFusion-1M (Vicuna-7B)
|
DenseFusion-1M: Merging Vision Experts for Compre…
|
37.80
|
2024-07-11
|
|
ShareGPT4V-7B
|
ShareGPT4V: Improving Large Multi-Modal Models wi…
|
37.60
|
2023-11-21
|
|
LLaVA-1.5+CoS
|
Chain-of-Spot: Interactive Reasoning Improves Lar…
|
37.60
|
2024-03-19
|
|
LLaVA-COCO-13B
|
COCO is "ALL'' You Need for Visual Instruction Fi…
|
37.50
|
2024-01-17
|
|
LLaVA-S^2 + DenseFusion-1M (Vicuna-7B)
|
DenseFusion-1M: Merging Vision Experts for Compre…
|
37.50
|
2024-07-11
|
|
MM1.5-1B
|
MM1.5: Methods, Analysis & Insights from Multimod…
|
37.40
|
2024-09-30
|
|
Dynamic-LLaVA-13B
|
Dynamic-LLaVA: Efficient Multimodal Large Languag…
|
37.30
|
2024-12-01
|
|
SeVa-7B
|
Self-Supervised Visual Preference Alignment
|
37.20
|
2024-04-16
|
|
SoM-LLaVA-1.5-T
|
List Items One by One: A New Data Source and Lear…
|
37.20
|
2024-04-25
|
|
Emu3
|
Emu3: Next-Token Prediction is All You Need
|
37.20
|
2024-09-27
|
|
LLaVA-Instruct (Vicuna-1.5-13B)
|
MM-Instruct: Generated Visual Instructions for La…
|
37.10
|
2024-06-28
|
|
ILLUME
|
ILLUME: Illuminating Your LLMs to See, Draw, and …
|
37.00
|
2024-12-09
|
|
LLaVA1.5-7B-BPO
|
Strengthening Multimodal Large Language Model wit…
|
36.80
|
2024-03-13
|
|
LLaVA-1.5-13B (+ MMFuser)
|
MMFuser: Multimodal Multi-Layer Feature Fuser for…
|
36.60
|
2024-10-15
|
|
CaMML-13B
|
CaMML: Context-Aware Multimodal Learner for Large…
|
36.40
|
2024-01-06
|
|
LLaVA-65B (Data Mixing)
|
An Empirical Study of Scaling Instruct-Tuned Larg…
|
36.40
|
2023-09-18
|
|
Vary-base
|
Vary: Scaling up the Vision Vocabulary for Large …
|
36.20
|
2023-12-11
|
|
StableLLaVA
|
StableLLaVA: Enhanced Visual Instruction Tuning w…
|
36.10
|
2023-08-20
|
|
DreamLLM-7B
|
DreamLLM: Synergistic Multimodal Comprehension an…
|
35.90
|
2023-09-20
|
|
MoE-LLaVA-2.7B×4-Top2
|
MoE-LLaVA: Mixture of Experts for Large Vision-La…
|
35.90
|
2024-01-29
|
|
SoM-LLaVA-1.5
|
List Items One by One: A New Data Source and Lear…
|
35.90
|
2024-04-25
|
|
Dragonfly (Llama3-8B)
|
Dragonfly: Multi-Resolution Zoom-In Encoding Enha…
|
35.90
|
2024-06-03
|
|
Ferret-v2-13B
|
Ferret-v2: An Improved Baseline for Referring and…
|
35.70
|
2024-04-11
|
|
AlignGPT (Vicuna-13B)
|
AlignGPT: Multi-modal Large Language Models with …
|
35.60
|
2024-05-23
|
|
LLaVA-HR-X
|
Feast Your Eyes: Mixture-of-Resolution Adaptation…
|
35.50
|
2024-03-05
|
|
SQ-LLaVA
|
SQ-LLaVA: Self-Questioning for Large Vision-Langu…
|
35.50
|
2024-03-17
|
|
LOVA$^3$
|
LOVA3: Learning to Visual Question Answering, Ask…
|
35.20
|
2024-05-23
|
|
LLaVA-InternLM2-7B-ViT + MoSLoRA
|
Mixture-of-Subspaces in Low-Rank Adaptation
|
35.20
|
2024-06-16
|
|
InternLM2+ViT (QMoSLoRA)
|
Mixture-of-Subspaces in Low-Rank Adaptation
|
35.20
|
2024-06-16
|
|
LLaVA1.5-7B-MDA
|
Looking Beyond Text: Reducing Language bias in La…
|
35.20
|
2024-11-21
|
|
Mipha-3B+
|
Rethinking Visual Prompting for Multimodal Large …
|
35.10
|
2024-07-05
|
|
Merlin
|
Merlin:Empowering Multimodal LLMs with Foresight …
|
34.90
|
2023-11-30
|
|
Arcana
|
Improving Multi-modal Large Language Model throug…
|
34.80
|
2024-10-17
|
|
INF-LLaVA
|
INF-LLaVA: Dual-perspective Perception for High-R…
|
34.50
|
2024-07-23
|
|
LLaVA-1.5+MMInstruct (Vicuna-7B)
|
MMInstruct: A High-Quality Multi-Modal Instructio…
|
34.40
|
2024-07-22
|
|
Janus
|
Janus: Decoupling Visual Encoding for Unified Mul…
|
34.30
|
2024-10-17
|
|
LLaVA-TokenPacker (Vicuna-13B)
|
TokenPacker: Efficient Visual Projector for Multi…
|
34.10
|
2024-07-02
|
|
γ-MoD-LLaVA-HR
|
$γ-$MoD: Exploring Mixture-of-Depth Adaptation fo…
|
34.00
|
2024-10-17
|
|
LLaVA-1.5-7B (CSR)
|
Calibrated Self-Rewarding Vision Language Models
|
33.90
|
2024-05-23
|
|
DynMOE-LLaVA
|
Dynamic Mixture of Experts: An Auto-Tuning Approa…
|
33.60
|
2024-05-23
|
|
Imp-2B
|
Imp: Highly Capable Large Multimodal Models for M…
|
33.50
|
2024-05-20
|
|
InfMLLM-7B-Chat
|
InfMLLM: A Unified Framework for Visual-Language …
|
33.40
|
2023-11-12
|
|
Video-LaVIT
|
Video-LaVIT: Unified Video-Language Pre-training …
|
33.20
|
2024-02-05
|
|
LLaVA-Instruct (Vicuna-1.5-7B)
|
MM-Instruct: Generated Visual Instructions for La…
|
32.90
|
2024-06-28
|
|
VisionZip (Retain 128 Tokens, fine-tuning)
|
VisionZip: Longer is Better but Not Necessary in …
|
32.90
|
2024-12-05
|
|
Uni-MoE
|
Uni-MoE: Scaling Unified Multimodal LLMs with Mix…
|
32.80
|
2024-05-18
|
|
VL-Mamba (Mamba LLM-2.8B)
|
VL-Mamba: Exploring State Space Models for Multim…
|
32.60
|
2024-03-20
|
|
LLaVA-v1.5 (7B, w/ STIC)
|
Enhancing Large Vision Language Models with Self-…
|
32.60
|
2024-05-30
|
|
VisionZip (Retain 192 Tokens, fine-tuning)
|
VisionZip: Longer is Better but Not Necessary in …
|
32.60
|
2024-12-05
|
|
VisionZip (Retain 128 Tokens)
|
VisionZip: Longer is Better but Not Necessary in …
|
32.60
|
2024-12-05
|
|
LLaVA-v1.5 (+MoCa)
|
Deciphering Cross-Modal Alignment in Large Vision…
|
32.20
|
2024-10-09
|
|
Dynamic-LLaVA-7B
|
Dynamic-LLaVA: Efficient Multimodal Large Languag…
|
32.20
|
2024-12-01
|
|
Mipha-3B
|
Mipha: A Comprehensive Overhaul of Multimodal Ass…
|
32.10
|
2024-03-10
|
|
VOLCANO 7B
|
Volcano: Mitigating Multimodal Hallucination thro…
|
32.00
|
2023-11-13
|
|
Video-LLaVA
|
Video-LLaVA: Learning United Visual Representatio…
|
32.00
|
2023-11-16
|
|
TinyLLaVA-share-Sig-Ph
|
TinyLLaVA: A Framework of Small-scale Large Multi…
|
32.00
|
2024-02-22
|
|
LLaVA-VT (Vicuna-7B)
|
Beyond Embeddings: The Promise of Visual Table in…
|
31.80
|
2024-03-27
|
|
VisionZip (Retain 192 Tokens)
|
VisionZip: Longer is Better but Not Necessary in …
|
31.70
|
2024-12-05
|
|
VisionZip (Retain 64 Tokens)
|
VisionZip: Longer is Better but Not Necessary in …
|
31.70
|
2024-12-05
|
|
LLaVA-1.5-7B (+ SIMA)
|
Enhancing Visual-Language Modality Alignment in L…
|
31.60
|
2024-05-24
|
|
MiCo-Chat-7B
|
Explore the Limits of Omni-modal Pretraining at S…
|
31.40
|
2024-06-13
|
|
LLaVA-1.5-7B + TeamLoRA
|
TeamLoRA: Boosting Low-Rank Adaptation with Exper…
|
31.20
|
2024-08-19
|
|
RoboCodeX-13B
|
RoboCodeX: Multimodal Code Generation for Robotic…
|
31.00
|
2024-02-25
|
|
HyperLLaVA
|
HyperLLaVA: Dynamic Visual and Language Expert Tu…
|
31.00
|
2024-03-20
|
|
FAST (Vicuna-7B)
|
Visual Agents as Fast and Slow Thinkers
|
31.00
|
2024-08-16
|
|
JanusFlow
|
JanusFlow: Harmonizing Autoregression and Rectifi…
|
30.90
|
2024-11-12
|
|
AlignGPT (Vicuna-7B)
|
AlignGPT: Multi-modal Large Language Models with …
|
30.80
|
2024-05-23
|
|
LLaVolta
|
Efficient Large Multi-modal Models via Visual Con…
|
30.70
|
2024-06-28
|
|
LLaVA-AlignedVQ
|
Aligned Vector Quantization for Edge-Cloud Collab…
|
30.70
|
2024-11-08
|
|
LLaVA-1.5-HACL
|
Hallucination Augmented Contrastive Learning for …
|
30.40
|
2023-12-12
|
|
MaVEn
|
MaVEn: An Effective Multi-granularity Hybrid Visu…
|
30.40
|
2024-08-22
|
|
VisionZip (Retain 64 Tokens, fine-tuning)
|
VisionZip: Longer is Better but Not Necessary in …
|
30.20
|
2024-12-05
|
|
H2OVL-Mississippi-0.8B
|
H2OVL-Mississippi Vision Language Models Technica…
|
30.00
|
2024-10-17
|
|
RoboMamba
|
RoboMamba: Efficient Vision-Language-Action Model…
|
29.70
|
2024-06-06
|
|
LLaVA-TokenPacker (Vicuna-7B)
|
TokenPacker: Efficient Visual Projector for Multi…
|
29.60
|
2024-07-02
|
|
OneLLM-7B
|
OneLLM: One Framework to Align All Modalities wit…
|
29.10
|
2023-12-06
|
|
LLaVA-OneVision-0.5B
|
LLaVA-OneVision: Easy Visual Task Transfer
|
29.10
|
2024-08-06
|
|
Vary-toy
|
Small Language Model Meets with Reinforced Vision…
|
29.00
|
2024-01-23
|
|
LLaVA-Phi
|
LLaVA-Phi: Efficient Multi-Modal Assistant with S…
|
28.90
|
2024-01-04
|
|
MMAR-7B
|
MMAR: Towards Lossless Multi-Modal Auto-Regressiv…
|
27.80
|
2024-10-14
|
|
SEAL (7B)
|
V*: Guided Visual Search as a Core Mechanism in M…
|
27.70
|
2023-12-21
|
|
OtterHD-8B
|
OtterHD: A High-Resolution Multi-modality Model
|
26.30
|
2023-11-07
|
|
TGA-7B
|
Cross-Modal Safety Mechanism Transfer in Large Vi…
|
25.60
|
2024-10-16
|
|
LinVT
|
LinVT: Empower Your Image-level Large Language Mo…
|
23.50
|
2024-12-06
|
|
Xmodel-VLM (Xmodel-LM 1.1B)
|
Xmodel-VLM: A Simple Baseline for Multimodal Visi…
|
21.80
|
2024-05-15
|
|
TextBind
|
TextBind: Multi-turn Interleaved Multimodal Instr…
|
19.40
|
2023-09-14
|
|
MMAR-0.5B
|
MMAR: Towards Lossless Multi-Modal Auto-Regressiv…
|
18.49
|
2024-10-14
|
|
Gemini 1.5 Pro (gemini-1.5-pro)
|
Gemini 1.5: Unlocking multimodal understanding ac…
|
|
2024-03-08
|
|
Gemini 1.5 Pro (gemini-1.5-pro-002)
|
Gemini 1.5: Unlocking multimodal understanding ac…
|
|
2024-03-08
|
|
GPT-4o (gpt-4o-2024-05-13)
|
GPT-4 Technical Report
|
|
2023-03-15
|
|
gpt-4o-mini-2024-07-18
|
GPT-4 Technical Report
|
|
2023-03-15
|
|
GPT-4V
|
GPT-4 Technical Report
|
|
2023-03-15
|
|
GPT-4V-Turbo-detail:high
|
GPT-4 Technical Report
|
|
2023-03-15
|
|
Qwen-VL-Max
|
Qwen-VL: A Versatile Vision-Language Model for Un…
|
|
2023-08-24
|
|
Gemini 1.0 Pro Vision (gemini-pro-vision)
|
Gemini: A Family of Highly Capable Multimodal Mod…
|
|
2023-12-19
|
|
Qwen-VL-Plus
|
Qwen-VL: A Versatile Vision-Language Model for Un…
|
|
2023-08-24
|
|
GPT-4V-Turbo-detail:low
|
GPT-4 Technical Report
|
|
2023-03-15
|
|
MM-ReAct-GPT-4
|
MM-REACT: Prompting ChatGPT for Multimodal Reason…
|
|
2023-03-20
|
|
LLaVA-1.5-13B
|
Improved Baselines with Visual Instruction Tuning
|
|
2023-10-05
|
|
Emu-14B
|
Emu: Generative Pretraining in Multimodality
|
|
2023-07-11
|
|
mPLUG-Owl2
|
mPLUG-Owl2: Revolutionizing Multi-modal Large Lan…
|
|
2023-11-07
|
|
LLaVA-Plus-13B (All Tools, V1.3, 336px)
|
LLaVA-Plus: Learning to Use Tools for Creating Mu…
|
|
2023-11-09
|
|
LRV-Instruction-7B
|
Mitigating Hallucination in Large Multi-Modal Mod…
|
|
2023-06-26
|
|
LLaMA-Adapter v2-7B
|
LLaMA-Adapter V2: Parameter-Efficient Visual Inst…
|
|
2023-04-28
|
|
LLaVA-1.5-7B
|
Improved Baselines with Visual Instruction Tuning
|
|
2023-10-05
|
|
MM-ReAct-GPT-3.5
|
MM-REACT: Prompting ChatGPT for Multimodal Reason…
|
|
2023-03-20
|
|
LLaVA-Plus-7B (All Tools)
|
LLaVA-Plus: Learning to Use Tools for Creating Mu…
|
|
2023-11-09
|
|
OpenFlamingo-9B (MPT-7B)
|
OpenFlamingo: An Open-Source Framework for Traini…
|
|
2023-08-02
|
|
Otter-9B (MPT-7B)
|
MIMIC-IT: Multi-Modal In-Context Instruction Tuni…
|
|
2023-06-08
|
|
Otter-9B (LLaMA)
|
MIMIC-IT: Multi-Modal In-Context Instruction Tuni…
|
|
2023-06-08
|
|
MiniGPT-4-14B
|
MiniGPT-4: Enhancing Vision-Language Understandin…
|
|
2023-04-20
|
|
BLIP-2-12B
|
BLIP-2: Bootstrapping Language-Image Pre-training…
|
|
2023-01-30
|
|
MiniGPT-4-8B
|
MiniGPT-4: Enhancing Vision-Language Understandin…
|
|
2023-04-20
|
|
OpenFlamingo-9B (LLaMA-7B)
|
OpenFlamingo: An Open-Source Framework for Traini…
|
|
2023-08-02
|
|