← ML Research Wiki / 2304.14178

mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality

Qinghao Ye [email protected] DAMO Academy Alibaba Group, Haiyang Xu DAMO Academy Alibaba Group, Guohai Xu [email protected] DAMO Academy Alibaba Group, Jiabo Ye DAMO Academy Alibaba Group, Ming Yan DAMO Academy Alibaba Group, Yiyang Zhou DAMO Academy Alibaba Group, Junyang Wang DAMO Academy Alibaba Group, Anwen Hu DAMO Academy Alibaba Group, Pengcheng Shi DAMO Academy Alibaba Group, Yaya Shi DAMO Academy Alibaba Group, Chenliang Li DAMO Academy Alibaba Group, Yuanhong Xu DAMO Academy Alibaba Group, Hehong Chen DAMO Academy Alibaba Group, Junfeng Tian DAMO Academy Alibaba Group, Qi Qian DAMO Academy Alibaba Group, Ji Zhang DAMO Academy Alibaba Group, Fei Huang DAMO Academy Alibaba Group, Jingren Zhou DAMO Academy Alibaba Group (2023)

Paper Information

arXiv ID

2304.14178

Venue

arXiv.org

Domain

artificial intelligence, natural language processing, computer vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation.In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module.This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration.The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM, while maintaining and even improving the generation abilities of LLM.In the first stage, the visual knowledge module and abstractor module are trained with frozen LLM module to align the image and text.In the second stage, languageonly and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module.We carefully build a visually-related instruction evaluation set OwlEval.Experimental results show that our model outperform existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability and knowledge reasoning ability.Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension.Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl.The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

Summary

This paper presents mPLUG-Owl, a novel training paradigm that enhances the multimodal capabilities of large language models (LLMs) through modularized learning of a foundational LLM, a visual knowledge module, and a visual abstractor module. The two-stage training method improves alignment between image and text and enhances the generation abilities of LLMs, effectively supporting multiple modalities and allowing for diverse unimodal and multimodal functionalities. The research introduces the OwlEval dataset for evaluating the capabilities of models in visual-related tasks and shows experimental results where mPLUG-Owl outperforms existing multimodal models in instruction understanding, knowledge reasoning, and multi-turn conversation abilities, showcasing unexpected skills in multi-image correlation and scene text understanding. The paper concludes by highlighting the practical applications and potential for further refinements in multimodal generation tasks.

Methods

This paper employs the following methods:

Modularization
Joint Instruction Tuning

Models Used

GPT-3
BLOOM
LLaMA
GPT-4
BLIP-2
LLaVA
MiniGPT-4
OwlEval
ViT-L/14
LLaMA-7B

Datasets

The following datasets were used in this research:

LAION-400M
COYO-700M
Conceptual Captions
MSCOCO
LLaVA

Evaluation Metrics

Instruction Understanding
Visual Understanding
Optical Character Recognition
Knowledge Transfer Ability
Reasoning Ability
Multi-turn Dialogue Ability

Results

Outperforming existing multimodal models
Improved instruction and visual understanding ability
Enhanced multi-turn conversation and knowledge reasoning abilities

Limitations

The authors identified the following limitations:

Lack of multilingual training
Limited performance in complex scene OCR

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

large language models multimodal learning modular training paradigm visual knowledge module visual abstractor

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 36
Influential Citations: 73

mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers