← ML Research Wiki / 2403.18814

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li The Chinese University, Yuechen Zhang The Chinese University, Chengyao Wang The Chinese University, Zhisheng Zhong The Chinese University, Yixin Chen The Chinese University, Ruihang Chu The Chinese University, Shaoteng Liu The Chinese University, Jiaya Jia The Chinese University, Hong Kong The Chinese University (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
computer vision, natural language processing
SOTA Claim
Yes
Reproducibility
6/10

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs).Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini.We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLMguided generation.To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count.We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs.In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B.It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models.

Summary

Mini-Gemini introduces a framework enhancing multi-modality Vision Language Models (VLMs) by addressing limitations in performance compared to advanced models like GPT-4. The approach focuses on improving visual token resolution through high-quality datasets and VLM-guided generation. The model employs a dual-encoder system for efficient high-resolution image processing and aims to enhance visual comprehension and reasoning within an any-to-any workflow paradigm. Empirical results demonstrate leading performance across several zero-shot benchmarks, outperforming previous models in complex tasks.

Methods

This paper employs the following methods:

  • VLM-guided generation
  • Patch info mining
  • Dual Vision Encoders
  • Any-to-any paradigm

Models Used

  • Mini-Gemini
  • GPT-4
  • Gemini
  • LLaVA-Next
  • Otter-HD
  • MobileVLM
  • InstructBLIP
  • Qwen-VL
  • Qwen-VL-Chat
  • IDEFICS-80B
  • LLaMA-VID
  • LLaVA-1.5
  • Mixtral-8x7B
  • Hermes-2-Yi-34B

Datasets

The following datasets were used in this research:

  • LLaVA-filtered CC3M
  • ALLaVA
  • TextCaps
  • DocVQA
  • ChartQA
  • DVQA
  • AI2D
  • LAION-GPT-4V

Evaluation Metrics

  • VQA T
  • MMB
  • MME
  • MM-Vet
  • MMMU
  • MathVista
  • Accuracy

Results

  • Achieves leading performance in several zero-shot benchmarks
  • Surpasses models such as Gemini Pro, Qwen-VL-Plus, and GPT 4V in complex MMB and MMU datasets

Limitations

The authors identified the following limitations:

  • Counting and complex visual reasoning abilities need improvement
  • Insufficient data during the pretraining stage affecting model performance
  • Embeddings approaches have shown no significant gains for reasoning-based generation

Technical Requirements

  • Number of GPUs: 8
  • GPU Type: A800

Keywords

visual language models multi-modality image understanding reasoning generation

External Resources