← ML Research Wiki / 2303.17580

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen Microsoft Research Asia, Kaitao Song Microsoft Research Asia, Xu Tan Microsoft Research Asia, Dongsheng Li Microsoft Research Asia, Weiming Lu Microsoft Research Asia, Yueting Zhuang [email protected] Microsoft Research Asia, Zhejiang University Microsoft Research Asia, Microsoft Research Microsoft Research Asia (2023)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
Artificial Intelligence, Machine Learning, Natural Language Processing, Computer Vision, Speech
SOTA Claim
Yes
Reproducibility
8/10

Abstract

Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence.While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously.Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this.Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks.Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results.By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.

Summary

This paper presents HuggingGPT, an agent that utilizes large language models (LLMs), particularly ChatGPT, to manage and execute complex AI tasks by leveraging expert models from machine learning communities like Hugging Face. HuggingGPT acts as a controller to perform task planning, model selection, task execution, and response generation, thus integrating multimodal capabilities in language, vision, and speech. The authors explore the challenges of LLMs when coordinating multiple expert models and propose HuggingGPT as a solution to address these challenges while demonstrating its effectiveness through extensive experiments on various tasks across different domains.

Methods

This paper employs the following methods:

  • HuggingGPT

Models Used

  • ChatGPT
  • gpt-3.5-turbo
  • text-davinci-003
  • gpt-4

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • F1
  • Accuracy
  • GPT-4 Score
  • Normalized Edit Distance

Results

  • HuggingGPT demonstrates the ability to manage and execute complex AI tasks effectively through integration of LLMs with expert models.
  • Extensive experiments indicate significant potential in HuggingGPT for multitasking across language, vision, and speech domains.

Limitations

The authors identified the following limitations:

  • Relies heavily on LLM capabilities for planning; feasibility and optimality of plans cannot always be ensured.
  • Challenges in efficiency due to multiple interactions with LLMs which increases response times.
  • Maximum token lengths can limit the ability to connect numerous models.
  • Instability issues from LLMs potentially failing to conform to instructions.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large Language Models Hugging Face ChatGPT AI Task Automation Multimodal AI Autonomous Agents

Papers Using Similar Methods

External Resources