Keqin Chen [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Zhao Zhang [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Weili Zeng Qing Yuan Research Institute} Shanghai Jiao Tong University, Richong Zhang Qing Yuan Research Institute} Shanghai Jiao Tong University, Feng Zhu Qing Yuan Research Institute} Shanghai Jiao Tong University, Rui Zhao Qing Yuan Research Institute} Shanghai Jiao Tong University, Sensetime Research Qing Yuan Research Institute} Shanghai Jiao Tong University, Sklsde Qing Yuan Research Institute} Shanghai Jiao Tong University, Beihang Qing Yuan Research Institute} Shanghai Jiao Tong University, Seiee Qing Yuan Research Institute} Shanghai Jiao Tong University (2023)
The paper introduces Shikra, a Multimodal Large Language Model (MLLM) designed to enhance referential dialogue capabilities in human-AI interactions. Unlike existing MLLMs, Shikra can effectively process spatial coordinates in natural language, enabling users to reference specific regions in images during conversations. The architecture is simple, consisting of a vision encoder, an alignment layer, and a language model, without the need for extra vocabularies or pre-/post-detection modules. Shikra performs well on various vision-language tasks such as Referring Expression Comprehension (REC), PointQA, Visual Question Answering (VQA), and image captioning, demonstrating promising performance without requiring fine-tuning. The paper highlights multiple applications for Shikra, including its potential use in mixed-reality environments and its role in facilitating interactions with visual robots. The authors also address limitations of the current model, such as language support restricted to English and challenges in dense object detection tasks.
This paper employs the following methods:
The following datasets were used in this research:
The authors identified the following limitations: