Computer ScienceICLR 2024

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs

Y. Qin, S. Liang, et al.

ToolLLM bridges the tool-use gap in open-source LLMs by introducing ToolBench — a ChatGPT-generated instruction-tuning dataset of 16,464 real-world RESTful APIs — along with a depth-first search decision-tree for richer reasoning, a neural API retriever, and the automatic evaluator ToolEval. This research, conducted by the authors listed in the <Authors> tag, produces ToolLLaMA with ChatGPT-comparable tool-use and strong zero-shot generalization.... show more

General Summary Metrics

Abstract

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16, 464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench. The codes, trained models, and demo are publicly available at https://github.com/OpenBMB/ToolBench

Publisher

ICLR 2024

Published On

Authors

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun

DOI

https://doi.org/10.48550/arXiv.2307.16789

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

Computer Science

Persuading large language models to comply with objectionable requests

L. Meincke, D. Shapiro, et al.

Medicine and Health

Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Y. Xu, Z. Fang, et al.

Computer Science

Evaluation of large language models on mental health: from knowledge test to illness diagnosis

Y. Xu, Z. Fang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny