Domain
Artificial Intelligence, Computer Vision, Natural Language Processing
ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called Visual ChatGPT, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at https: //github.com/microsoft/visual-chatgpt.
The paper presents Visual ChatGPT, a system that combines ChatGPT with various Visual Foundation Models (VFMs) to enable multimodal interactions including talking, drawing, and editing images. By integrating a Prompt Manager, it dynamically converts visual inputs into language, allowing ChatGPT to handle complex visual tasks through iterative prompts and responses. The experiment results demonstrate its effectiveness in addressing intricate visual inquiries, while also managing the collaboration among the VFMs. However, limitations such as dependency on existing models and real-time processing constraints are acknowledged.
This paper employs the following methods:
- ChatGPT
- BLIP Model
- Stable Diffusion
The following datasets were used in this research:
- Visual ChatGPT effectively combines language and visual input handling
- Massive zero-shot experiments verify understanding and generation capabilities
The authors identified the following limitations:
- Dependence on ChatGPT and VFMs
- Heavy prompt engineering required
- Limited real-time capabilities
- Token length limitation in ChatGPT
- Security and privacy concerns
- Number of GPUs: 4
- GPU Type: Nvidia V100
Visual ChatGPT
multimodal interaction
Visual Foundation Models
large language models
prompt engineering