Domain
computer vision, natural language processing
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs).Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini.We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLMguided generation.To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count.We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs.In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B.It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models.
Mini-Gemini introduces a framework enhancing multi-modality Vision Language Models (VLMs) by addressing limitations in performance compared to advanced models like GPT-4. The approach focuses on improving visual token resolution through high-quality datasets and VLM-guided generation. The model employs a dual-encoder system for efficient high-resolution image processing and aims to enhance visual comprehension and reasoning within an any-to-any workflow paradigm. Empirical results demonstrate leading performance across several zero-shot benchmarks, outperforming previous models in complex tasks.
This paper employs the following methods:
- VLM-guided generation
- Patch info mining
- Dual Vision Encoders
- Any-to-any paradigm
- Mini-Gemini
- GPT-4
- Gemini
- LLaVA-Next
- Otter-HD
- MobileVLM
- InstructBLIP
- Qwen-VL
- Qwen-VL-Chat
- IDEFICS-80B
- LLaMA-VID
- LLaVA-1.5
- Mixtral-8x7B
- Hermes-2-Yi-34B
The following datasets were used in this research:
- LLaVA-filtered CC3M
- ALLaVA
- TextCaps
- DocVQA
- ChartQA
- DVQA
- AI2D
- LAION-GPT-4V
- VQA T
- MMB
- MME
- MM-Vet
- MMMU
- MathVista
- Accuracy
- Achieves leading performance in several zero-shot benchmarks
- Surpasses models such as Gemini Pro, Qwen-VL-Plus, and GPT 4V in complex MMB and MMU datasets
The authors identified the following limitations:
- Counting and complex visual reasoning abilities need improvement
- Insufficient data during the pretraining stage affecting model performance
- Embeddings approaches have shown no significant gains for reasoning-based generation
- Number of GPUs: 8
- GPU Type: A800
visual language models
multi-modality
image understanding
reasoning
generation