Computer Science
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs
Y. Qin, S. Liang, et al.
ToolLLM bridges the tool-use gap in open-source LLMs by introducing ToolBench — a ChatGPT-generated instruction-tuning dataset of 16,464 real-world RESTful APIs — along with a depth-first search decision-tree for richer reasoning, a neural API retriever, and the automatic evaluator ToolEval. This research, conducted by the authors listed in the <Authors> tag, produces ToolLLaMA with ChatGPT-comparable tool-use and strong zero-shot generalization.
~3 min • Beginner • English
Introduction
Tool learning aims to unleash the power of large language models (LLMs) to effectively interact with various tools (APIs) to accomplish complex tasks. By integrating LLMs with APIs, we can greatly expand their utility and empower them to serve as efficient intermediaries between users and the vast ecosystem of applications. Although open-source LLMs, e.g., LLaMA, have achieved versatile capabilities through instruction tuning, they still lack the sophistication in performing higher-level tasks, such as appropriately interacting with tools (APIs) to fulfill complex human instruction. This deficiency is because current instruction tuning largely focuses on basic language tasks, with a relative neglect of the tool-use domain. Current SOTA LLMs (e.g., ChatGPT and GPT-4), which have demonstrated impressive competencies in utilizing tools, are closed-source with their inner mechanisms opaque, limiting democratization and community-driven innovation.
Prior works on tool-use instruction tuning have limitations: (1) limited APIs, often not involving real-world REST APIs or considering only small, poorly diverse API sets; (2) constrained scenarios, focusing only on single-tool instructions or relying on users to manually specify ideal API sets; and (3) inferior planning and reasoning, commonly adopting CoT or ReACT without executing APIs to obtain real responses.
To facilitate tool-use capabilities within open-source LLMs, the paper introduces ToolLLM, a general tool-use framework including data construction, model training, and evaluation. The authors collect ToolBench, an instruction-tuning dataset constructed automatically using ChatGPT with function calling, covering single-tool and multi-tool scenarios. They propose a depth-first search-based decision tree (DFSDT) to bolster planning and reasoning, enabling evaluation of multiple reasoning paths and strategic retraction or progression. They also develop ToolEval, an automatic evaluator backed by ChatGPT, with pass rate and win rate metrics.
By fine-tuning LLaMA on ToolBench, ToolLLaMA demonstrates compelling capability for single-tool and complex multi-tool instructions, comparable performance to ChatGPT, robust generalization to unseen APIs using documentation, and strong OOD generalization on APIBench. DFSDT serves as a general decision-making strategy improving reasoning over ReACT. A neural API retriever recommends appropriate APIs, showing high retrieval precision. ToolLLaMA performs on par with specialized pipelines in OOD settings without training on those APIs.
Literature Review
Related work spans tool learning, instruction tuning, and prompting LLMs for decision making. Tool learning highlights LLMs mastering tools and making decisions in complex environments, enabling real-time factual knowledge, multimodal functionalities, and specialized domain skills, while open-source LLMs lag behind SOTA closed models in tool use. Instruction tuning enhances LLMs for understanding instructions and generating proper responses; self-instruct methods curate high-quality data, but tool learning is more challenging due to API diversity and multi-tool complexity, and existing datasets often fail to address real human needs. Prompting approaches like ReACT integrate reasoning and acting with environmental feedback but lack decision retraction; Reflexion introduces failure reflection; DFSDT extends these by assessing different reasoning paths and selecting promising ones. DFSDT shares a similar idea to tree-of-thought (ToT) reasoning but targets general decision-making with infinite decision spaces compared to ToT’s simpler tasks.
Methodology
Dataset construction (ToolBench) comprises three stages: API collection, instruction generation, and solution path annotation, all automated using ChatGPT (gpt-3.5-turbo-16k) with minimal human supervision.
API collection: RapidAPI Hub provides thousands of real-world APIs across 49 categories and 500+ collections. The authors crawl tool-level and API-level metadata, including names, descriptions, hosts, HTTP methods, parameters, request bodies, code snippets, and example responses, enabling LLMs to understand and use APIs. Initial crawl collected 10,853 tools (53,190 APIs). A rigorous filtering process removes unreliable APIs (e.g., 404s, long latencies, low-quality responses), retaining 3,451 high-quality tools with 16,464 APIs.
Instruction generation: To ensure diversity and multi-tool usage, the process samples combinations of APIs and prompts ChatGPT to generate instructions (and corresponding relevant APIs) that involve the sampled APIs. The prompt includes a general task description, comprehensive documentation of each API, and three in-context seed examples (12 single-tool and 36 multi-tool seeds authored by experts). Sampling strategies generate single-tool instructions by iterating tools and multi-tool instructions by selecting 2–5 tools from the same category or collection and sampling up to 3 APIs per tool, yielding intra-category and intra-collection multi-tool instructions. Hallucinated relevant APIs are filtered by checking existence in the crawled hub. The result is nearly 200k qualified (instruction, relevant API) pairs: 87,413 (I1), 84,815 (I2), and 25,251 (I3).
Solution path annotation: Each instruction requires searching for a valid action sequence (chain of API calls) through multi-round ChatGPT interactions. Actions specify thought, API name, and parameters. APIs are provided as functions via ChatGPT’s function call feature, along with two special functions: Finish with Final Answer and Finish by Giving Up. To overcome limitations of CoT/ReACT (error propagation, limited exploration), the authors propose a Depth-First Search-based Decision Tree (DFSDT). DFSDT expands the search space by assessing multiple reasoning paths, allowing progression along promising paths or retraction by giving up and restarting at a new node. Node expansion encourages diversity by conditioning on previously generated nodes. DFS is preferred over BFS to finish upon finding any valid path with fewer API calls. DFSDT is applied to all instructions, retaining only passed solution paths, producing 126,486 (instruction, solution path) pairs for training.
ToolEval: An automatic evaluator built on ChatGPT measures tool-use capability with pass rate (ability to successfully execute instructions within budgets) and win rate (preference between two solution paths based on richness, factuality, reasoning, milestones, exploration, and cost). ToolEval achieves high agreement with human evaluation (87.1% pass rate, 80.3% win rate).
API retriever: A dense retriever based on Sentence-BERT (BERT-BASE) encodes instructions and API documents to embeddings and ranks relevance by similarity. Training uses relevant APIs as positives and sampled negatives for contrastive learning. Baselines include BM25 and OpenAI’s text-embedding-ada-002. Evaluation uses NDCG@1 and NDCG@5 across I1, I2, I3.
Model training (ToolLLaMA): LLaMA-2 7B is fine-tuned on instruction–solution pairs. To accommodate long API responses, positional interpolation extends context length from 4096 to 8192. Training uses multi-round conversation format; function call metadata is concatenated into prompts. Hyperparameters: learning rate 5e-5, warmup ratio 4e-2, batch size 64, max sequence length 8192, position interpolation ratio 2, trained for two epochs with best checkpoint selection. Response compression reduces overly long API responses by removing unimportant keys based on compression schemas derived via ChatGPT with in-context examples, ensuring essential content fits within token limits.
Evaluation settings: Generalization is assessed at three levels for I1 (unseen instructions, unseen tools within seen categories, unseen categories), two levels for I2 (unseen instructions, unseen categories), and one level for I3 (unseen instructions). Baselines include Vicuna, Alpaca, ChatGPT, Text-Davinci-003, GPT-4, Claude-2, each tested with ReACT and DFSDT. For practical integration, ToolLLaMA is also evaluated with top-5 APIs retrieved by the trained API retriever.
Key Findings
ToolEval reliability: ToolEval shows 87.1% agreement with human pass rate and 80.3% with human win rate.
API retrieval: The dense retriever substantially outperforms baselines across instruction types. Average NDCG@1/@5: BM25 18.5/17.0; Ada 49.6/45.4; Ours 78.0/84.9. By type, Ours achieves I1 NDCG@1/@5 84.2/89.7, I2 68.2/77.9, I3 81.7/87.1.
Reasoning strategy: DFSDT improves pass rates over ReACT and ReACT@N. Average pass rate: ReACT 35.3%, ReACT@N 44.5%, DFSDT 63.8%. By type, DFSDT pass rate is 58.0% (I1), 70.6% (I2), 62.8% (I3).
Main experiments (ToolBench): Using DFSDT, ToolLLaMA achieves strong performance nearly on par with ChatGPT and second to GPT-4 among tested models. Average pass/win rates:
- ChatGPT-DFSDT: 64.8% pass, 64.3% win.
- GPT-4-DFSDT: 71.1% pass, 70.4% win.
- ToolLLaMA-DFSDT: 66.7% pass, 60.0% win.
- ToolLLaMA-DFSDT-Retriever (top-5 retrieved APIs): 67.3% pass, 63.1% win, outperforming ToolLLaMA with oracle API sets in win rate and slightly in pass rate.
- ReACT is consistently weaker; e.g., ToolLLaMA-ReACT average 29.0% pass, 47.0% win.
Baselines Vicuna and Alpaca fail to pass any instruction (0% pass/win), showing standard dialogue-tuned LLaMA variants lack tool-use ability.
OOD generalization (APIBench): ToolLLaMA generalizes strongly to unseen domains (HuggingFace, TorchHub, TensorHub). With the trained retriever: Hallucination/AST accuracy: HF 10.60/16.77, TorchHub 15.70/51.16, TensorHub 6.48/40.59, surpassing Gorilla+BM25 (ZS/RS) on HF and TorchHub AST. With oracle retriever: ToolLLaMA achieves HF 8.66/88.80, TorchHub 14.12/85.88, TensorHub 7.44/88.62, consistently superior to Gorilla-ZS+Oracle and close to Gorilla-RS+Oracle.
Discussion
The results address the core problem of equipping open-source LLMs with robust tool-use capabilities across diverse, real-world APIs. DFSDT substantially enhances planning and execution compared to ReACT, expanding the search space and mitigating error propagation, which translates into higher pass and win rates across single-tool and multi-tool scenarios. ToolLLaMA, trained on ToolBench, demonstrates competitive performance with ChatGPT, validating that instruction tuning on diverse, real API calls and multi-step solution paths can elicit tool-use skills in open models. The trained API retriever effectively selects relevant APIs from a vast pool, sometimes outperforming oracle sets by identifying functionally superior alternatives, thereby improving overall outcomes. Strong OOD performance on APIBench further shows that ToolLLaMA leverages API documentation to adapt to unseen domains without retraining, indicating practical utility for integrating novel APIs.
Conclusion
The paper presents ToolBench, a large-scale, realistic tool-use dataset covering 16k+ real-world APIs and diverse single-tool and multi-tool scenarios. It introduces DFSDT to reinforce LLM planning and reasoning by strategically exploring multiple reasoning paths, and ToolEval for efficient automatic evaluation of tool-use solutions. Fine-tuning LLaMA on ToolBench yields ToolLLaMA, which matches ChatGPT in many settings and generalizes to unseen APIs and out-of-distribution domains. A neural API retriever integrates with ToolLLaMA to automate API selection in real-world usage. The work paves the way for future research at the intersection of instruction tuning and tool use for LLMs.
Limitations
DFSDT consumes more OpenAI API calls than ReACT, though a pre-order traversal design reduces sorting costs while maintaining exploration. API temporal variability and the multiplicity of valid solution paths make it infeasible to annotate fixed ground-truth paths for evaluation, necessitating ToolEval. Despite high agreement rates, tool-use evaluation is intricate and even human annotators often disagree on preferences between solution paths. Some RapidAPI endpoints are unreliable or low-quality and require filtering; long API responses necessitate compression to fit context windows, potentially omitting non-critical details.
Related Publications
Explore these studies to deepen your understanding of the subject.

