← ML Research Wiki / 2306.15195

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Zhao Zhang [email protected] Qing Yuan Research Institute} Shanghai Jiao Tong University, Weili Zeng Qing Yuan Research Institute} Shanghai Jiao Tong University, Richong Zhang Qing Yuan Research Institute} Shanghai Jiao Tong University, Feng Zhu Qing Yuan Research Institute} Shanghai Jiao Tong University, Rui Zhao Qing Yuan Research Institute} Shanghai Jiao Tong University, Sensetime Research Qing Yuan Research Institute} Shanghai Jiao Tong University, Sklsde Qing Yuan Research Institute} Shanghai Jiao Tong University, Beihang Qing Yuan Research Institute} Shanghai Jiao Tong University, Seiee Qing Yuan Research Institute} Shanghai Jiao Tong University (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Artificial Intelligence, Computer Vision, Natural Language Processing
Code
Reproducibility
8/10

Abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others.In turn, the other person can then respond by referring to specific regions if necessary.This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs).To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language.Its architecture consists of a vision encoder, an alignment layer, and a LLM.It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plugin models.All inputs and outputs are in natural language form.Referential dialogue is a superset of various vision-language (VL) tasks.Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA.Experimental results showcase Shikra's promising performance.Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities.Our code and model are accessed at https://github.com/shikras/shikra.

Summary

The paper introduces Shikra, a Multimodal Large Language Model (MLLM) designed to enhance referential dialogue capabilities in human-AI interactions. Unlike existing MLLMs, Shikra can effectively process spatial coordinates in natural language, enabling users to reference specific regions in images during conversations. The architecture is simple, consisting of a vision encoder, an alignment layer, and a language model, without the need for extra vocabularies or pre-/post-detection modules. Shikra performs well on various vision-language tasks such as Referring Expression Comprehension (REC), PointQA, Visual Question Answering (VQA), and image captioning, demonstrating promising performance without requiring fine-tuning. The paper highlights multiple applications for Shikra, including its potential use in mixed-reality environments and its role in facilitating interactions with visual robots. The authors also address limitations of the current model, such as language support restricted to English and challenges in dense object detection tasks.

Methods

This paper employs the following methods:

  • Multimodal Large Language Model
  • Referential Dialogue

Models Used

  • Shikra-7B
  • Shikra-13B
  • LLaVA-13B

Datasets

The following datasets were used in this research:

  • Flickr30K Entities
  • LVIS
  • Visual Genome
  • RefCOCO
  • Visual7W

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1-score

Results

  • Shikra demonstrates promising performance on REC, PointQA, VQA, and image captioning tasks without fine-tuning.
  • Shikra achieves state-of-the-art performance in various settings.

Limitations

The authors identified the following limitations:

  • Shikra only supports English and may not be user-friendly for non-English speakers.
  • Shikra is unsuitable for dense object detection and segmentation tasks.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Multimodal Large Language Models referential dialogue spatial coordinate inputs and outputs vision-language tasks

Papers Using Similar Methods

External Resources