← ML Research Wiki / 2306.15195

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Zhao Zhang [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Weili Zeng Qing Yuan Research Institute} Shanghai Jiao Tong University, Richong Zhang Qing Yuan Research Institute} Shanghai Jiao Tong University, Feng Zhu Qing Yuan Research Institute} Shanghai Jiao Tong University, Rui Zhao Qing Yuan Research Institute} Shanghai Jiao Tong University, Sensetime Research Qing Yuan Research Institute} Shanghai Jiao Tong University, Sklsde Qing Yuan Research Institute} Shanghai Jiao Tong University, Beihang Qing Yuan Research Institute} Shanghai Jiao Tong University, Seiee Qing Yuan Research Institute} Shanghai Jiao Tong University (2023)

Paper Information

arXiv ID

2306.15195

Venue

arXiv.org

Domain

Artificial Intelligence, Computer Vision, Natural Language Processing

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others.In turn, the other person can then respond by referring to specific regions if necessary.This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs).To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language.Its architecture consists of a vision encoder, an alignment layer, and a LLM.It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plugin models.All inputs and outputs are in natural language form.Referential dialogue is a superset of various vision-language (VL) tasks.Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA.Experimental results showcase Shikra's promising performance.Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities.Our code and model are accessed at https://github.com/shikras/shikra.

Summary

The paper introduces Shikra, a Multimodal Large Language Model (MLLM) designed to enhance referential dialogue capabilities in human-AI interactions. Unlike existing MLLMs, Shikra can effectively process spatial coordinates in natural language, enabling users to reference specific regions in images during conversations. The architecture is simple, consisting of a vision encoder, an alignment layer, and a language model, without the need for extra vocabularies or pre-/post-detection modules. Shikra performs well on various vision-language tasks such as Referring Expression Comprehension (REC), PointQA, Visual Question Answering (VQA), and image captioning, demonstrating promising performance without requiring fine-tuning. The paper highlights multiple applications for Shikra, including its potential use in mixed-reality environments and its role in facilitating interactions with visual robots. The authors also address limitations of the current model, such as language support restricted to English and challenges in dense object detection tasks.

Methods

This paper employs the following methods:

Multimodal Large Language Model
Referential Dialogue

Models Used

Shikra-7B
Shikra-13B
LLaVA-13B

Datasets

The following datasets were used in this research:

Flickr30K Entities
LVIS
Visual Genome
RefCOCO
Visual7W

Evaluation Metrics

Accuracy
Precision
Recall
F1-score

Results

Shikra demonstrates promising performance on REC, PointQA, VQA, and image captioning tasks without fine-tuning.
Shikra achieves state-of-the-art performance in various settings.

Limitations

The authors identified the following limitations:

Shikra only supports English and may not be user-friendly for non-English speakers.
Shikra is unsuitable for dense object detection and segmentation tasks.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Multimodal Large Language Models referential dialogue spatial coordinate inputs and outputs vision-language tasks

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 57
Influential Citations: 94

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers