Xiaoyi Dong Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Pan Zhang Shanghai Artificial Intelligence Laboratory, Yuhang Zang Shanghai Artificial Intelligence Laboratory, Yuhang Cao Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Bin Wang Shanghai Artificial Intelligence Laboratory, Linke Ouyang Shanghai Artificial Intelligence Laboratory, Xilin Wei Shanghai Artificial Intelligence Laboratory, Songyang Zhang, Haodong Duan Shanghai Artificial Intelligence Laboratory, Maosong Cao, Wenwei Zhang Shanghai Artificial Intelligence Laboratory, Yining Li Shanghai Artificial Intelligence Laboratory, Hang Yan Shanghai Artificial Intelligence Laboratory, Yang Gao Shanghai Artificial Intelligence Laboratory, Xinyue Zhang Shanghai Artificial Intelligence Laboratory, Wei Li Shanghai Artificial Intelligence Laboratory, Jingwen Li Shanghai Artificial Intelligence Laboratory, Kai Chen Shanghai Artificial Intelligence Laboratory SenseTime Group, Conghui He Shanghai Artificial Intelligence Laboratory, Xingcheng Zhang Shanghai Artificial Intelligence Laboratory SenseTime Group, Yu Qiao Shanghai Artificial Intelligence Laboratory, Dahua Lin Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong, Jiaqi Wang Shanghai Artificial Intelligence Laboratory (2024)
This paper introduces InternLM-XComposer2, an advanced vision-language model designed for free-form text-image composition and comprehension. It builds upon its predecessor, InternLM-XComposer, allowing for more complex composition requirements and demonstrating superior vision-language understanding and composition capabilities. Key innovations include the Partial LoRA (PLoRA) method, which enhances image token processing while preserving language model integrity. The model, trained with diverse and high-quality datasets, showcases its performance across various benchmarks, exceeding existing multimodal models and matching advanced models like GPT-4V and Gemini Pro. The extensive experiments validate its stepping ahead in multimodal understanding and creative content generation.
This paper employs the following methods:
The following datasets were used in this research: