← ML Research Wiki / 2403.18814

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li The Chinese University, Yuechen Zhang The Chinese University, Chengyao Wang The Chinese University, Zhisheng Zhong The Chinese University, Yixin Chen The Chinese University, Ruihang Chu The Chinese University, Shaoteng Liu The Chinese University, Jiaya Jia The Chinese University, Hong Kong The Chinese University (2024)

Paper Information

arXiv ID

2403.18814

Venue

arXiv.org

Domain

computer vision, natural language processing

SOTA Claim

Yes

Reproducibility

6/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs).Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini.We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLMguided generation.To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count.We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs.In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B.It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models.

Summary

Mini-Gemini introduces a framework enhancing multi-modality Vision Language Models (VLMs) by addressing limitations in performance compared to advanced models like GPT-4. The approach focuses on improving visual token resolution through high-quality datasets and VLM-guided generation. The model employs a dual-encoder system for efficient high-resolution image processing and aims to enhance visual comprehension and reasoning within an any-to-any workflow paradigm. Empirical results demonstrate leading performance across several zero-shot benchmarks, outperforming previous models in complex tasks.

Methods

This paper employs the following methods:

VLM-guided generation
Patch info mining
Dual Vision Encoders
Any-to-any paradigm

Models Used

Mini-Gemini
GPT-4
Gemini
LLaVA-Next
Otter-HD
MobileVLM
InstructBLIP
Qwen-VL
Qwen-VL-Chat
IDEFICS-80B
LLaMA-VID
LLaVA-1.5
Mixtral-8x7B
Hermes-2-Yi-34B

Datasets

The following datasets were used in this research:

LLaVA-filtered CC3M
ALLaVA
TextCaps
DocVQA
ChartQA
DVQA
AI2D
LAION-GPT-4V

Evaluation Metrics

VQA T
MMB
MME
MM-Vet
MMMU
MathVista
Accuracy

Results

Achieves leading performance in several zero-shot benchmarks
Surpasses models such as Gemini Pro, Qwen-VL-Plus, and GPT 4V in complex MMB and MMU datasets

Limitations

The authors identified the following limitations:

Counting and complex visual reasoning abilities need improvement
Insufficient data during the pretraining stage affecting model performance
Embeddings approaches have shown no significant gains for reasoning-based generation

Technical Requirements

Number of GPUs: 8
GPU Type: A800

Keywords

visual language models multi-modality image understanding reasoning generation

External Resources

Funding: Not specified
References: 67
Influential Citations: 37

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers