Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's `imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.
STE/
├─ tool_metadata/: tool related metadata
├─ prompts/: full prompts used
├─ saved_results/: prediction results in json
├─ {*, *_FT, *_ICL}.json: results for baseline model, tool-enhanced w/ fine-tuning, tool-enhanced with ICL
├─ CL_round_*.json: continual learning (each round)
├─ main.py: main script for STE
├─ postprocessing.py: filtering & paraphrasing for tool enhancement
├─ evaluation.ipynb: evaluation script and cached evaluation results
├─ my_llm.py: helper functions for LLM API call
└── utils.py: other helper functions
llama-recipes/ (adapted from https://github.com/facebookresearch/llama-recipes/)
├─ configs/: configurations for model training
├─ training.py: model training-related arguments
├─ ...
├─ ft_datasets/: cached data files for fine-tuning and testing
├─ api2neighbors.json: nearest neighbors for each API (based on API description similarity)
├─ flan_v2_2k.json: 2k random examples from flan_v2
├─ tool_data_train_STE_*.json: distilled tool-specific training data
├─ tool_test*.json: test set (w/ retrieved demonstration examples)
├─ ...
├─ inference/: helper functions for model inference
├─ sysmsg_dir/: system messages for tool and non-tool mode
├─ jobs/: example bash scripts for training/inference
├─ llama_finetuning.py: scripts for model training
├─ data_proc_format.py: data formatting/merging for model training
└── demo_retrieve.ipynb: nearest-neighbor demonstration retrieval
Put your OpenAI API key in api_key.txt
in the parent directory.
- For
STE/
, install ToolBench, BMTools and acquire the associated API keys following their respective instructions, and then
pip install -r requirements.txt
- For
llama-recipes/
, set up the environment following https://github.com/facebookresearch/llama-recipes.
cd STE
python main.py \
--model_ckpt gpt-3.5-turbo-16k-0613 \
--num_episodes 15 \
--num_stm_slots 4 \
--max_turn 4 \
--dir_write <your_directory_to_write> \
--rapidapi_key <your_rapidapi_key> \
--if_visualize
For STE with custom APIs, simply append the API names and descriptions to API_list.json
and API_descriptions.json
in tool_metadata/
, and change the run_tool
function in main.py
to enable the execution of newly-added tools.
python postprocessing.py \
--directory <your_directory_to_write> \
--filter_model_ckpts gpt-4-8k \
--paraphrase_model_ckpts gpt-3.5-turbo-16k-0613 \
--target_num_train_per_API 150 \
--num_para_train_max 6 \
--save_file_name <your_save_file_name> \
--if_visualize
cd llama-recipes/
python data_proc_format.py \
--tool_file <your_save_file_name> \
--data_save_dir <your_data_directory> \
--no_general \
--add_tool_response
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py \
--enable_fsdp \
--model_name <your_model_directory> \
--num_epochs 2 \
--batch_size_training 16 \
--micro_batch_size 1 \
--val_batch_size 8 \
--lr 2e-5 \
--num_workers_dataloader 1 \
--seed 42 \
--data_path <your_data_directory> \
--max_words_dataset 2048 \
--checkpoint_folder <your_directory_to_save> \
--save_with_hf \
--warmup_ratio 0.03 \
--save_epoch_interval 1 \
--add_token_list ft_datasets/toolken_list_50.json
CUDA_VISIBLE_DEVICES=0 python inference/inference_chat.py \
--model_name <your_model_directory> \
--data_path ft_datasets/tool_test.json \
--save_path <your_save_directory> \
--item_type query \
--sys_msg_dir sys_msg_dir/sysmsg_tool.json \
--quantization
First run demo_retrieve.ipynb
to prepare retrieved demonstration examples.
- For GPT-3.5/4:
cd STE/
python test_gpt.py \
--model_ckpt {gpt-35-turbo-16k-0613|gpt-4-0613}\
--save_name <save_file_name> \
--setting ICL \
--if_visualize
- For models based on Llama/Mistral:
cd llama-recipes/
CUDA_VISIBLE_DEVICES=0 python inference/inference_chat.py \
--model_name <your_model_directory> \
--data_path ft_datasets/tool_test_OTR_DR.json \
--save_path <your_save_directory> \
--item_type dialog \
--sys_msg_dir sys_msg_dir/sysmsg_tool.json \
--quantization
For round {0|1|2|3},
cd llama-recipes/
python data_proc_format.py \
--tool_file ft_datasets/tool_data_batches.json \
--batch_id {0|1|2|3} \
--data_save_dir ft_datasets/CL_round_{0|1|2|3}.json \
--general_data_file ft_datasets/flan_v2_2k.json \
--add_tool_response \
{--no_replay}
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py \
--enable_fsdp \
--model_name <your_model_directory> \
--num_epochs 2 \
--batch_size_training 16 \
--micro_batch_size 1 \
--val_batch_size 8 \
--lr 2e-5 \
--num_workers_dataloader 1 \
--seed 42 \
--data_path ft_datasets/CL_round_{0|1|2|3}.json \
--max_words_dataset 2048 \
--checkpoint_folder ft_datasets/CL_round_{0|1|2|3}.json \
--save_with_hf \
--warmup_ratio 0.03 \
--save_epoch_interval 1 \
--add_token_list ft_datasets/toolken_list_50.json
STE/evaluation.ipynb
includes the evaluation scripts and cached evaluation results for all predictions files in STE/saved_results/
@misc{wang2024llms,
title={LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error},
author={Boshi Wang and Hao Fang and Jason Eisner and Benjamin Van Durme and Yu Su},
year={2024},
eprint={2403.04746},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/pdf/2403.04746.pdf}
}