← ML Research Wiki / 2307.16789

TOOLLLM: FACILITATING LARGE LANGUAGE MODELS TO MASTER 16000+ REAL-WORLD APIS

Yujia Qin [email protected] Tsinghua University, Shihao Liang Tsinghua University, Yining Ye Tsinghua University, Kunlun Zhu Tsinghua University, Lan Yan Tsinghua University, Yaxi Lu, Yankai Lin Tsinghua University Renmin University of China, Xin Cong Tsinghua University, Xiangru Tang Yale University 5 WeChat AI Tencent Inc. 6 Zhihu Inc, Bill Qian Yale University 5 WeChat AI Tencent Inc. 6 Zhihu Inc, Sihan Zhao Tsinghua University, Lauren Hong Tsinghua University, Runchu Tian Tsinghua University, Ruobing Xie, Jie Zhou, Mark Gerstein Yale University 5 WeChat AI Tencent Inc. 6 Zhihu Inc, Dahai Li ModelBest Inc, Zhiyuan Liu Tsinghua University, Maosong Sun Tsinghua University, • • Movies, Tool API API Collection Instruction Generation Solution Path Annotation LLaMA ToolLLaMA RapidAPI (2023)

Paper Information
arXiv ID
Venue
International Conference on Learning Representations
Domain
Artificial Intelligence, Natural Language Processing, Machine Learning

Abstract

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions.The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain.This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT.To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation.We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT.Specifically, the construction can be divided into three stages: (i) API collection: we collect 16, 464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction.To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm.It enables LLMs to evaluate multiple reasoning traces and expand the search space.Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval.Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction.Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT.Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.The codes, trained models, and demo are publicly available at https://github.com/OpenBMB/ToolBench.

Summary

The paper presents ToolLLM, a framework designed to enhance the tool-use capabilities of large language models (LLMs) like LLaMA by enabling them to effectively interact with APIs. Recognizing that open-source LLMs have limitations in tool use compared to closed-source models like ChatGPT, the authors developed a new instruction-tuning dataset called ToolBench, which includes 16,464 real-world RESTful APIs across 49 categories. The dataset construction involves API collection from RapidAPI Hub, instruction generation using ChatGPT, and solution path annotation using a depth-first search-based decision tree algorithm. Additionally, ToolEval, an automatic evaluator, is introduced to assess model performance based on pass and win rates. Experimentation demonstrates that the fine-tuned ToolLLaMA model exhibits strong performance in executing complex instructions and generalizing to unseen APIs, while outperforming various baseline models.

Methods

This paper employs the following methods:

  • Deep First Search-based Decision Tree (DFSDT)
  • ToolEval

Models Used

  • ToolLLaMA
  • LLaMA

Datasets

The following datasets were used in this research:

  • ToolBench
  • APIBench
  • RapidAPI

Evaluation Metrics

  • Pass rate
  • Win rate

Results

  • ToolLLaMA outperforms Text-Davinci-003 and Claude-2, showing comparable performance to ChatGPT
  • Strong zero-shot generalization ability on APIBench dataset

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

Large Language Models API, Tool use Instruction tuning Open-source models GPT-3.5 GPT-4 LLaMA ToolBench DFSDT

External Resources