AgentTuning: Enabling Generalized Agent Abilities For LLMs

🤗 模型 (AgentLM-70B) • 🤗 数据集 (AgentInstruct) • 📃 论文 • 🌐 项目主页

AgentTuning 是首次利用多个 Agent 任务交互轨迹对 LLM 进行指令调整的方法。评估结果表明，AgentTuning 让 LLM 在未见过的 Agent 任务中也展现出强大的泛化能力，同时通用语言能力也基本保持不变。

AgentInstruct 数据集和 AgentLM 模型均已开源。

主要结果

Figure 1 在 held-in 和 held-out 任务上的总得分

AgentInstruct

AgentInstruct 是一个经过挑选的智能体数据集，包含 1866 个高质量交互、6 个多样化的真实场景任务，用于增强语言模型的 Agent 能力，有如下特性

🔍 思维链 - 采用 ReAct 提示词策略，为每步操作提供详细的思维链，深入理解模型决策过程
🌍 多样性 - 涵盖 6 个现实世界场景，包括日常家务到操作数据库，平均回合数 5 ~ 35 不等。
🎯 精确性 - GPT-4 也不能完全做对智能体任务，使用轨迹奖励机制对数据严格筛选，确保每条数据的质量。
✅ 泛化性 - 严格检查，避免数据泄露，保证数据的泛化性

AgentInstruct 数据集开源在 🤗Huggingface Repo

AgentLM

AgentLM 由 Llama2-chat 开源模型系列在 AgentInstruct，ShareGPT 混合数据集上微调得到。模型遵循 Llama-2-chat 的对话格式，系统提示词固定为 You are a helpful, respectful and honest assistant.。

7B、13B 和 70B 模型开源网址如下

Model	Huggingface Repo
AgentLM-7B	🤗Huggingface Repo
AgentLM-13B	🤗Huggingface Repo
AgentLM-70B	🤗Huggingface Repo

运行 AgentLM

使用 Text-Generation-Inference 加速评测流程，启动一个 AgentLM-70b 实例：

cd docker
docker compose -f agentlm-70b.yml up

成功部署后的端口位于 30070，可以向其发送请求：

curl 127.0.0.1:30070/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"inputs": "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant.\n<</SYS>>\n\nHello! [/INST]", "parameters":{"temperature": 1.0}}'

# {"generated_text":"Hello! How can I help you today? "}

可在 docker compose 文件后面增加更多端口，产生多个推理实例。

评测

模型评测包含 6 个 held-in 任务、6 个 held-out 任务、通用任务

Held-in 任务

6 个保留任务来源于 AgentBench。但是，由于 AgentBench 仍在开发中，最新版本可能无法完全重现论文中报告的结果。

本项目有关评测代码位于./AgentBench.old 文件夹中。

Held-out 任务

Held-out 任务来源于以下开源框架

任务	AgentTuning 评测脚本	原始仓库
SciWorld	📂 eval_heldout/science-world	💻 allenai/ScienceWorld
MiniWoB++	📂 eval_heldout/miniwob++	💻 Farama-Foundation/miniwob-plusplus
HotpotQA	📂 eval_heldout/hotpotQA	💻 salesforce/BOLAA
ReWOO	📂 eval_heldout/rewoo	💻 billxbf/ReWOO
WebArena	📂 eval_heldout/webarena	💻 web-arena-x/webarena
Digital Card Game	💻 AgentBench.old ( Extend Split )	💻 THUDM/AgentBench

通用任务

MMLU 配置

下载 14k 多项选择题到 ./data 文件夹：

cd data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
tar xf data.tar
cd ..

执行以下代码评测 Hf 模型 MMLU 得分：

python eval_general/evaluate_mmlu_hf.py -c THUDM/AgentLM-70b

GSM8k 配置

部署 TGI
运行以下代码评测 GSM8k
```
python eval_general/evaluate_gsm8k_tgi.py --port 30070
```
使用 --sample-input-file 可以加载本地数据，否则脚本会下载 GSM8K 到本地

MT-Bench 配置

本地安装 FastChat

git clone https://github.com/lm-sys/FastChat.git
pip install -e FastChat

部署 TGI

运行评测脚本

python eval_general/eval_mt_bench_tgi.py --host 127.0.0.1 --port 30070 --model-id agentlm-70b

使用 GPT-4 评测回答

cd FastChat/fastchat/llm_judge
OPENAI_API_KEY=<your-api-key> python gen_judgment.py --model-list agentlm-70b --parallel <number-of-cuncurrent-requests>

引用

如果你觉得我们的工作有帮助的话，请考虑引用下列论文

@misc{zeng2023agenttuning,
      title={AgentTuning: Enabling Generalized Agent Abilities for LLMs},
      author={Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang},
      year={2023},
      eprint={2310.12823},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-zh.md

README-zh.md

AgentTuning: Enabling Generalized Agent Abilities For LLMs

主要结果

AgentInstruct

AgentLM

运行 AgentLM

评测

Held-in 任务

Held-out 任务

通用任务

引用

Files

README-zh.md

Latest commit

History

README-zh.md

File metadata and controls

AgentTuning: Enabling Generalized Agent Abilities For LLMs

主要结果

AgentInstruct

AgentLM

运行 AgentLM

评测

Held-in 任务

Held-out 任务

通用任务

引用