← ML Research Wiki / 2401.16420

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models

Xiaoyi Dong Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Pan Zhang Shanghai Artificial Intelligence Laboratory, Yuhang Zang Shanghai Artificial Intelligence Laboratory, Yuhang Cao Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Bin Wang Shanghai Artificial Intelligence Laboratory, Linke Ouyang Shanghai Artificial Intelligence Laboratory, Xilin Wei Shanghai Artificial Intelligence Laboratory, Songyang Zhang, Haodong Duan Shanghai Artificial Intelligence Laboratory, Maosong Cao, Wenwei Zhang Shanghai Artificial Intelligence Laboratory, Yining Li Shanghai Artificial Intelligence Laboratory, Hang Yan Shanghai Artificial Intelligence Laboratory, Yang Gao Shanghai Artificial Intelligence Laboratory, Xinyue Zhang Shanghai Artificial Intelligence Laboratory, Wei Li Shanghai Artificial Intelligence Laboratory, Jingwen Li Shanghai Artificial Intelligence Laboratory, Kai Chen Shanghai Artificial Intelligence Laboratory SenseTime Group, Conghui He Shanghai Artificial Intelligence Laboratory, Xingcheng Zhang Shanghai Artificial Intelligence Laboratory SenseTime Group, Yu Qiao Shanghai Artificial Intelligence Laboratory, Dahua Lin Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Jiaqi Wang Shanghai Artificial Intelligence Laboratory (2024)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
artificial intelligence, computer vision, natural language processing
SOTA Claim
Yes

Abstract

Free-form Text-Image Composition Free-form ComprehensionChinese New Year, also known as the Spring Festival, is a major traditional festival in China.It is celebrated on the first day of the lunar calendar and usually falls between late January and mid-February.This year, it will be on February 1st.The Spring Festival has a history of more than 4,000 years, and during this time, many customs have emerged and been passed down through generations.These customs include pasting couplets, setting off firecrackers, paying respects to ancestors, eating dumplings, etc.In addition to firecrackers, other noise-making devices such as dragon dancers and lion dancers are also used to create a festive atmosphere.These performances are often accompanied by drums, cymbals, and other musical instruments.While setting off firecrackers is still widely practiced today, some cities have banned the use of certain types of firecrackers due to safety concerns.In these areas, people may opt for electronic replacements or simply enjoy the sounds of drums and cymbals instead.Paying Respects to AncestorsThe Spring Festival is also a time to honor one's ancestors.Families typically gather together to perform ancestral rituals, including burning incense, offering sacrifices, and praying for blessings.Many families will also visit their ancestors' gravesites to clean them up and pay their respects.Some will even stay overnight at the cemetery, bringing bedding and food with them.The act of honoring one's ancestors is a reminder of one's roots and history.It also helps to foster a sense of unity and belonging among family members.Eating DumplingsDumplings are a popular food item during the Spring Festival.They are believed to symbolize wealth and prosperity, as they resemble ancient Chinese silver ingots.Please write an article based on the given requirements with the title 'Introduction to Chinese New Year Customs,' using the provided materials: 1. Chinese New Year customs include pasting couplets, setting off firecrackers, and paying respects to ancestors.2. New Year delicacies include eating dumplings and glutinous rice balls, expressing the symbolic meaning of family reunion.

Summary

This paper introduces InternLM-XComposer2, an advanced vision-language model designed for free-form text-image composition and comprehension. It builds upon its predecessor, InternLM-XComposer, allowing for more complex composition requirements and demonstrating superior vision-language understanding and composition capabilities. Key innovations include the Partial LoRA (PLoRA) method, which enhances image token processing while preserving language model integrity. The model, trained with diverse and high-quality datasets, showcases its performance across various benchmarks, exceeding existing multimodal models and matching advanced models like GPT-4V and Gemini Pro. The extensive experiments validate its stepping ahead in multimodal understanding and creative content generation.

Methods

This paper employs the following methods:

  • Partial LoRA

Models Used

  • InternLM-XComposer2
  • InternLM2-7B
  • GPT-4V
  • Gemini Pro

Datasets

The following datasets were used in this research:

  • ShareGPT4V-PT
  • COCO
  • Nocaps
  • TextCaps
  • LAION400M
  • SBU
  • CC 3M
  • WanJuan
  • Flicker
  • MMC-Instruction

Evaluation Metrics

  • Math-Vista
  • MMMU
  • AI2D
  • MME
  • MM-Bench
  • MMBench-Chinese
  • SEED-Bench (Image)
  • LLaVA-Bench (In-the-Wild)
  • QBench
  • MM-Vet
  • HallusionBench
  • ChartQA
  • POPE
  • CreationBench

Results

  • InternLM-XComposer2 surpasses existing benchmarks in both composition and comprehension.
  • Demonstrated superiority in producing high-quality long-text multimodal content.
  • Matched or surpassed GPT-4V and Gemini Pro in certain assessments.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

vision-language models text-image composition multimodal understanding large language models benchmark evaluation

Papers Using Similar Methods

External Resources