← ML Research Wiki / 2303.04671

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu Microsoft Research Asia, Shengming Yin Microsoft Research Asia, Weizhen Qi [email protected] Microsoft Research Asia, Xiaodong Wang Microsoft Research Asia, Zecheng Tang [email protected] Microsoft Research Asia, Nan Duan [email protected] Microsoft Research Asia (2023)

Paper Information
arXiv ID
Venue
arXiv.org
Domain
Artificial Intelligence, Computer Vision, Natural Language Processing
SOTA Claim
Yes
Reproducibility
7/10

Abstract

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called Visual ChatGPT, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at https: //github.com/microsoft/visual-chatgpt.

Summary

The paper presents Visual ChatGPT, a system that combines ChatGPT with various Visual Foundation Models (VFMs) to enable multimodal interactions including talking, drawing, and editing images. By integrating a Prompt Manager, it dynamically converts visual inputs into language, allowing ChatGPT to handle complex visual tasks through iterative prompts and responses. The experiment results demonstrate its effectiveness in addressing intricate visual inquiries, while also managing the collaboration among the VFMs. However, limitations such as dependency on existing models and real-time processing constraints are acknowledged.

Methods

This paper employs the following methods:

  • Prompt Manager

Models Used

  • ChatGPT
  • BLIP Model
  • Stable Diffusion

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • None specified

Results

  • Visual ChatGPT effectively combines language and visual input handling
  • Massive zero-shot experiments verify understanding and generation capabilities

Limitations

The authors identified the following limitations:

  • Dependence on ChatGPT and VFMs
  • Heavy prompt engineering required
  • Limited real-time capabilities
  • Token length limitation in ChatGPT
  • Security and privacy concerns

Technical Requirements

  • Number of GPUs: 4
  • GPU Type: Nvidia V100

Keywords

Visual ChatGPT multimodal interaction Visual Foundation Models large language models prompt engineering

Papers Using Similar Methods

External Resources