diff --git a/README.md b/README.md index b373b0c333..786d8e81ad 100644 --- a/README.md +++ b/README.md @@ -358,7 +358,7 @@ This project is built upon many excellent open-source projects, including: + [verl](https://github.com/volcengine/verl) and [PyTorch's FSDP](https://pytorch.org/docs/stable/fsdp.html) for LLM training; + [vLLM](https://github.com/vllm-project/vllm) for LLM inference; + [Data-Juicer](https://github.com/modelscope/data-juicer?tab=readme-ov-file) for data processing pipelines; -+ [AgentScope](https://github.com/modelscope/agentscope) for agentic workflow; ++ [AgentScope](https://github.com/agentscope-ai/agentscope) for agentic workflow; + [Ray](https://github.com/ray-project/ray) for distributed systems; + we have also drawn inspirations from RL frameworks like [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [TRL](https://github.com/huggingface/trl) and [ChatLearn](https://github.com/alibaba/ChatLearn); + ...... diff --git a/README_zh.md b/README_zh.md index 3248bb1451..0354c80e34 100644 --- a/README_zh.md +++ b/README_zh.md @@ -358,7 +358,7 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml + [verl](https://github.com/volcengine/verl) 和 [PyTorch's FSDP](https://pytorch.org/docs/stable/fsdp.html) 用于大模型训练; + [vLLM](https://github.com/vllm-project/vllm) 用于大模型推理; + [Data-Juicer](https://github.com/modelscope/data-juicer?tab=readme-ov-file) 用于数据处理管道; -+ [AgentScope](https://github.com/modelscope/agentscope) 用于智能体工作流; ++ [AgentScope](https://github.com/agentscope-ai/agentscope) 用于智能体工作流; + [Ray](https://github.com/ray-project/ray) 用于分布式系统; + 我们也从 [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)、[TRL](https://github.com/huggingface/trl) 和 [ChatLearn](https://github.com/alibaba/ChatLearn) 等框架中汲取了灵感; + ...... diff --git a/docs/sphinx_doc/assets/agentscope_gsm8k_reward.png b/docs/sphinx_doc/assets/agentscope_gsm8k_reward.png index 44d6496b77..61ff756e8a 100644 Binary files a/docs/sphinx_doc/assets/agentscope_gsm8k_reward.png and b/docs/sphinx_doc/assets/agentscope_gsm8k_reward.png differ diff --git a/docs/sphinx_doc/assets/agentscope_gsm8k_turns.png b/docs/sphinx_doc/assets/agentscope_gsm8k_turns.png new file mode 100644 index 0000000000..e925d14be1 Binary files /dev/null and b/docs/sphinx_doc/assets/agentscope_gsm8k_turns.png differ diff --git a/docs/sphinx_doc/source/main.md b/docs/sphinx_doc/source/main.md index a79d3f4548..2ed44e5b54 100644 --- a/docs/sphinx_doc/source/main.md +++ b/docs/sphinx_doc/source/main.md @@ -45,7 +45,7 @@ This project is built upon many excellent open-source projects, including: + [verl](https://github.com/volcengine/verl) and [PyTorch's FSDP](https://pytorch.org/docs/stable/fsdp.html) for LLM training; + [vLLM](https://github.com/vllm-project/vllm) for LLM inference; + [Data-Juicer](https://github.com/modelscope/data-juicer?tab=readme-ov-file) for data processing pipelines; -+ [AgentScope](https://github.com/modelscope/agentscope) for agentic workflow; ++ [AgentScope](https://github.com/agentscope-ai/agentscope) for agentic workflow; + [Ray](https://github.com/ray-project/ray) for distributed systems; + we have also drawn inspirations from RL frameworks like [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [TRL](https://github.com/huggingface/trl) and [ChatLearn](https://github.com/alibaba/ChatLearn); + ...... diff --git a/docs/sphinx_doc/source/tutorial/example_react.md b/docs/sphinx_doc/source/tutorial/example_react.md index 1edd420280..d43ff65d6e 100644 --- a/docs/sphinx_doc/source/tutorial/example_react.md +++ b/docs/sphinx_doc/source/tutorial/example_react.md @@ -1,142 +1,192 @@ -# Multi-Step ReAct +# ReAct Agent Training -This example serves as a demonstration for adapting the Trinity-RFT training workflow to your own agentic project, through our OpenAI-compatible `ModelWrapper` class. +This section demonstrates how to train a ReAct Agent using Trinity-RFT. We use [AgentScope](https://github.com/agentscope-ai/agentscope) as an example and leverage its built-in ReAct agent to solve GSM8K math problems. Developers can refer to this example to adapt Trinity-RFT's training to their own agent projects. -Here, we use the [AgentScope](https://github.com/modelscope/agentscope) framework as an example, but you can certainly use any other framework, as Trinity offers great flexibility. This example fine-tunes a model on the GSM8K math dataset by leveraging an agent that uses ReAct-style reasoning with native tool calls. +## Key Features -## Key Features Demonstrated +Before diving into the example, let's review several important features of Trinity-RFT for Agentic-RL training. -This example highlights several advanced capabilities of the Trinity-RFT framework: +### Compatible with Various Agent Frameworks -### Seamless Integration with External Agent Frameworks -Trinity-RFT is designed to be highly modular. You can easily embed complex, pre-existing agent logic from external frameworks like AgentScope directly into a Trinity `Workflow`. +There are many agent development frameworks, each with different model wrapping and invocation methods. To maximize compatibility, Trinity-RFT wraps the `openai.OpenAI` and `openai.AsyncOpenAI` interfaces. As long as your agent framework supports calling models via the OpenAI interface, you can train agents using Trinity-RFT's `OpenAI` or `AsyncOpenAI` instances. You can also implement your own agent directly using Trinity-RFT's OpenAI interface without any framework. -- **No Need for Rewrites**: You don't have to re-implement the intricate logic of your agent (e.g., the ReAct loop, memory management, or tool invocation) within Trinity. -- **Focus on High-Level Orchestration**: As shown in our `AgentScopeReactV2MathWorkflow`, the Trinity workflow simply initializes and calls the external agent's `reply` method. Trinity abstracts away the underlying complexity, allowing you to focus on the high-level task orchestration and reward design. +### No Need to Modify Agent Code -### General Multi-Step Training -Modern agentic tasks often involve multiple steps of reasoning, tool use, and observation. Trinity-RFT natively supports training across these Multi-Step interactions. +Training agents requires collecting dialogue history and other relevant information (such as `token_id`, `logprobs`) during agent execution, which often requires modifying source code of the agent application. Trinity-RFT avoids this by wrapping the `openai.OpenAI` or `openai.AsyncOpenAI` instances, automatically collecting all necessary training information during model calls, so that you don't need to change your agent code. -- **Step-Wise Experience Generation**: Instead of only learning from the final answer, Trinity can treat each step within an agent's reasoning trajectory as a distinct learning opportunity. -- **Credit Assignment**: The reward for solving a task is propagated back to all experiences within the successful trajectory, enabling the model to learn the entire reasoning chain, not just the final response. This is controlled by the `advantage_fn` in the config. +### Supports Multi-Turn Interaction -### Native Tool Calling Support -Trinity-RFT's inference engine and training pipeline are built to support the native OpenAI `tool_calls` format. +Agent tasks often involve multiple steps of reasoning and actioning. Trinity-RFT natively supports RL training for tasks with multi-turn interactions, without limiting the number of turns (just ensure each LLM call's sequence length does not exceed the model's maximum). This allows you to design dynamic-length interactions based on task complexity. Trinity-RFT's dynamic synchronization mechanism enables training to start as soon as enough samples are collected, improving efficiency. -- **Direct Training on Tool Use**: The framework allows the model to be trained on deciding *when* to call a tool, *which* tool to call, and *what* arguments to use, all formatted in the standard `tool_calls` convention. -- **Interoperability**: This native support ensures seamless integration with any service or environment that consumes the OpenAI API format, such as an `MCP_server` (Multi-Agent Collaboration Platform) or other tool-use evaluators. +## Implementation -## How It Works +We will walk through how to train a ReAct agent implemented with AgentScope using Trinity-RFT. -Below we show you how to perform this step-by-step. +### 1. Change the OpenAI client of your Agent -### The Workflow (`workflow.py`) +The {class}`AgentScopeReActAgent ` wraps AgentScope's ReAct agent and injects Trinity-RFT's `openai.AsyncOpenAI` instance during initialization. The subsequent execution is handled by the AgentScope agent itself, with no code modification required. -The core logic is encapsulated in the `AgentScopeReactMathWorkflow` class. +```python +# A simplified version of trinity.common.workflows.agentscope.react.react_agent.AgentScopeReActAgent +class AgentScopeReActAgent: + def __init__( + self, + openai_client: openai.AsyncOpenAI, # provided by Trinity-RFT + # some other params + ): + """Initialize the AgentScope ReAct agent with specified tools and model. -1. **Initialization (`__init__`)**: - - It first initializes the AgentScope environment and the desired agent (`ReActAgent`). - - The most critical integration step is injecting Trinity's model client into the AgentScope agent: - ```python - self.openai_client = model.get_openai_client() - # self.openai_client = get_openai_async_client() # or async client depend on whether you are using async openai client - # ... - self.agent.model.client = self.openai_client - ``` - This ensures that all API calls made by the AgentScope agent are routed through Trinity's `ModelWrapper`, which records the entire conversation history. + Args: + openai_client (openai.AsyncOpenAI): An instance of AsyncOpenAI client. + """ + self.agent_model = OpenAIChatModel( + api_key="EMPTY", + model_name=model_name, + generate_kwargs=generate_kwargs, + stream=False, + ) + # patch the OpenAIChatModel to use the openai_client provided by Trinity-RFT + self.agent_model.client = openai_client + self.agent = ReActAgent( + name="react_agent", + model=self.agent_model, + ) + + async def reply(self, query): + """Generate a response based on the query.""" + # no need to modify your agent logic + return await self.agent.reply( + Msg("user", query, role="user") + ) +``` + +```{note} +We encapsulate AgentScope's ReAct agent in a new class here to clearly demonstrate the process of replacing the OpenAI client. +In practice, you can directly modify the OpenAI client of your existing agent without creating a new class. +``` -2. **Execution (`run`)**: - - The `run` method is remarkably simple. It just passes the task description to the agent. - ```python - content = self.agent.reply(msg).content # your agent logic - ``` - - After the agent completes its multi-step reasoning and produces a final answer, Trinity extracts all the intermediate turns from the model's history: - ```python - experiences = self.model.extract_experience_from_history(clear_history=True) - ``` - - A reward is calculated based on the final answer and is applied to all `Experience` objects generated from the trajectory. These experiences are then sent to the buffer for training. -### Configuration +### 2. Implement the Training Workflow + +The {class}`AgentScopeReActWorkflow ` demonstrates the agent training workflow. Its core `run_async` method includes three steps: + + 1. Call the agent to complete the task and return the result. + 2. Evaluate the result and calculate the reward. + 3. Collect trainable data generated during task execution and combine it with the reward to create training samples (`Experience`). + +```python +# A simplified version of trinity.common.workflows.agentscope.react.react_workflow.AgentScopeReActWorkflow +class AgentScopeReActWorkflow(Workflow): + def __init__( + self, + *, + task: Task, + model: ModelWrapper, + auxiliary_models: Optional[List[openai.OpenAI]] = None, + ): + # initialize the agent + self.agent = AgentScopeReActAgent( + openai_client=model.get_openai_async_client(), + # some other params + ) + # get query from the task + self.query = task.raw_task.get(task.format_args.prompt_key) # type: ignore [index] + + async def run_async(self): + """Run the workflow asynchronously.""" + # Step 1: call the ReAct agent to solve the task + response = await self.agent.reply(self.query) + # Step 2: calculate the reward based on the response + reward = await self.calculate_reward(response) + # Step 3: construct experiences from the interaction history and return them + return self.construct_experiences(reward) + + async def calculate_reward(self, response) -> float: + """Calculate the reward based on the response.""" + # your reward logic + + def construct_experiences(self, reward: float) -> List[Experience]: + """Construct experiences from the agent's interaction history. + + Returns: + List: A list of Experience objects. + """ + # Extract all interaction history generated by this task + exps = self.model.extract_experience_from_history() + # update the reward for each experience + for exp in exps: + exp.reward = reward + return exps -The configuration file fine-tunes the behavior of the entire system. Here are the key parameters for this example: +``` + +### 3. Training Configuration -#### Native Tool Calling Settings +Trinity-RFT uses configuration files to control the training workflow. Below are key configuration parameters for this example. -These settings in the `explorer.rollout_model` section configure the vLLM-based engine to generate and parse OpenAI-compatible tool calls. -We use the `Qwen3` model and host model with vLLM. The configuration for different model can be found in [vLLM Toolcalls](https://docs.vllm.ai/en/stable/features/tool_calling.html#qwen-models) +#### Inference Model Configuration +The `explorer.rollout_model` section configures the model used by the agent application. Key parameters include: ```yaml explorer: rollout_model: # ... - enable_auto_tool_choice: true # Enables the model to generate `tool_calls` - tool_call_parser: hermes # Specifies the parser for formatting tool call outputs - reasoning_parser: deepseek_r1 # Helps in parsing the model's thought process - enable_thinking: true # Enables the model to generate intermediate "thoughts" + enable_openai_client: true # Enable OpenAI Client + enable_history: true # Enable automatic call history recording + enable_auto_tool_choice: true # Allow model to generate `tool_calls` + tool_call_parser: hermes # Specify parser for tool call outputs + reasoning_parser: deepseek_r1 # Helps parse model reasoning process + enable_thinking: true # Enable thinking (mainly for Qwen3 series models) ``` -#### Multi-Step Training Strategy +#### Multi-Step Training Algorithm -This setting in the `algorithm` section defines how experiences from a Multi-Step rollout are processed. +The `algorithm` section configures the training algorithm for the agent application. Key parameters include: ```yaml algorithm: algorithm_type: grpo - advantage_fn: step_wise_grpo # Key for Multi-Step training + advantage_fn: step_wise_grpo # The key for multi-step training. This strategy tells Trinity to create independent training samples for each step in the agent's execution path. The `grpo` algorithm then uses these samples to update the model. ``` -- `step_wise_grpo`: This strategy tells Trinity to create a distinct training sample for each step in the agent's execution path. The `grpo` algorithm then uses these samples to update the model. -#### Asynchronous Synchronization for Efficiency +#### Dynamic Synchronization Configuration -Because Multi-Step rollouts produce a variable number of experiences, waiting for a fixed number of *rollouts* is inefficient. We use a dynamic synchronization strategy. +Since agent applications may have variable interaction rounds and sample counts, we enable Trinity-RFT's dynamic synchronization to improve efficiency. Relevant configuration: ```yaml synchronizer: - sync_style: dynamic_by_explorer # Start training when enough experiences are ready - sync_interval: 2 + sync_style: dynamic_by_explorer # Trainer starts training immediately when enough data is generated, rather than padding to a fixed size, improving efficiency + sync_interval: 2 # Check for model parameter updates after every two batches ``` -- `sync_style: dynamic_by_explorer`: The trainer starts a training job as soon as the buffer has collected enough *experiences* (i.e., individual turns), rather than waiting for a fixed number of full agent trajectories. This significantly improves GPU utilization and training throughput. -## How to Run the Example +## Running the Example -1. **Prerequisites**: Ensure you have Trinity installed, along with the dependencies for this example (e.g., `AgentScope`). Please refer to [Agentscope Github link](https://github.com/agentscope-ai/agentscope/tree/v0). +1. Install dependencies: Follow the [Installation Guide](./trinity_installation.md) to install Trinity-RFT and AgentScope v1.0 or above. -> **NOTE**: This example requires AgentScope from either: -> - Commit: `ad13ed5dacecb79d20abf626769f8c7d7a7d2afb` -> - Branch: [`v0`](https://github.com/agentscope-ai/agentscope/tree/v0) +```bash +pip install agentscope>=1.0.4 +``` -2. Download the model you want to use, and fill in the configuration files in `examples/agentscope_tool_react/agentscopev0_tool_react_gsm8k.yaml` or `examples/agentscope_tool_react/agentscopev0_tool_react_dapo.yaml` +2. Download model and dataset: -3. **Launch the training job**: Run the following command from the root directory of the repository. +```bash +huggingface-cli download Qwen/Qwen3-8B +huggingface-cli download openai/gsm8k --repo-type dataset +``` - ```bash - trinity run --config examples/agentscope_tool_react/agentscopev0_tool_react_gsm8k.yaml - ``` +3. Start the training task: - or + ```bash + # Navigate to the Trinity-RFT root directory + cd /path/to/Trinity-RFT - ```bash - trinity run --config examples/agentscope_tool_react/agentscopev0_tool_react_dapo.yaml - ``` + # Run the training for GSM8k dataset: + trinity run --config examples/agentscope_react/gsm8k.yaml + ``` +## Results -The example here for gsm8k dataset is really simple and it can converge in a few minutes on 8 H20 GPUs. +Reward curve: ![](../../assets/agentscope_gsm8k_reward.png) - -The example here for dapo dataset take a little bit longer, but it also converges. - -![](../../assets/agentscope_dapo_reward.png) - -We can also see that the model generally start to use more tool calls to solve the problems. - -![](../../assets/agentscope_dapo_turns.png) - -We can also update the agentscope version to v1, and training on the qwen3-4b-instrcut-2507 - -![](../../assets/agentscope_dapo_qwen3-4B_reward.png) - -## Summary - -This example is simple but demonstrates the power and flexibility of Trinity for training complex, Multi-Step agents that use tools. By seamlessly integrating external agentic logic and providing native support for Multi-Step training and tool calls, Trinity-RFT empowers you to fine-tune models on sophisticated, realistic tasks with high efficiency. diff --git a/docs/sphinx_doc/source_zh/main.md b/docs/sphinx_doc/source_zh/main.md index 5acc3baf0e..9ea1558f03 100644 --- a/docs/sphinx_doc/source_zh/main.md +++ b/docs/sphinx_doc/source_zh/main.md @@ -44,7 +44,7 @@ Trinity-RFT 是一个灵活、通用的大语言模型(LLM)强化微调(RF + [verl](https://github.com/volcengine/verl) 和 [PyTorch's FSDP](https://pytorch.org/docs/stable/fsdp.html) 用于大模型训练; + [vLLM](https://github.com/vllm-project/vllm) 用于大模型推理; + [Data-Juicer](https://github.com/modelscope/data-juicer?tab=readme-ov-file) 用于数据处理管道; -+ [AgentScope](https://github.com/modelscope/agentscope) 用于智能体工作流; ++ [AgentScope](https://github.com/agentscope-ai/agentscope) 用于智能体工作流; + [Ray](https://github.com/ray-project/ray) 用于分布式系统; + 我们也从 [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)、[TRL](https://github.com/huggingface/trl) 和 [ChatLearn](https://github.com/alibaba/ChatLearn) 等框架中汲取了灵感; + ...... diff --git a/docs/sphinx_doc/source_zh/tutorial/example_react.md b/docs/sphinx_doc/source_zh/tutorial/example_react.md index cf1fa22472..69d76d4784 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_react.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_react.md @@ -1,143 +1,202 @@ -# ReAct 例子 +# ReAct Agent 训练 -本示例用于演示如何通过我们兼容 OpenAI 接口的 `ModelWrapper` 类,将 Trinity-RFT 训练工作流适配到你自己的智能体项目中。 +本节将会展示如何借助 Trinity-RFT 训练一个基于智能体框架实现的 ReAct Agent。这里我们以 [AgentScope](https://github.com/agentscope-ai/agentscope) 框架为例,并使用其内置的 ReAct 智能体来解决 GSM8K 数学问题。开发者可以参考此示例,将 Trinity-RFT 的训练工作流适配到自己的智能体项目中。 -这里我们以 [AgentScope](https://github.com/modelscope/agentscope) 框架为例,但你完全可以使用其他任何框架,因为 Trinity 提供了极大的灵活性。该示例利用一个采用 ReAct 风格推理并支持原生工具调用的智能体(Agent),在 GSM8K 数学数据集上对模型进行微调。 ## 关键特性 -此示例突出了 Trinity-RFT 框架的几项高级特性: +在介绍案例之前,我们先来看看 Trinity-RFT 在训练智能体应用方面的几个重要特性。 -### 与外部智能体框架的无缝集成 -Trinity-RFT 被设计为高度模块化,因此你可以轻松地将来自外部框架(如 AgentScope)的复杂、现成的智能体逻辑直接嵌入到 Trinity 的 `Workflow` 中。 +### 兼容各种智能体框架 -- **无需重写智能体**:你不必在 Trinity 内重新实现智能体的复杂逻辑(例如 ReAct 循环、内存管理或工具调用)。 -- **关注高层编排**:正如我们在 `AgentScopeReactV2MathWorkflow` 中所展示的那样,Trinity 工作流只需初始化并调用外部智能体的 `reply` 方法。Trinity 对底层复杂性负责,使你能专注于高层任务编排和奖励设计。 +当前智能体开发框架众多,对模型的封装和调用方式也各不相同。为了最大限度地兼容各种框架,Trinity-RFT 对 `openai.OpenAI` 以及 `openai.AsyncOpenAI` 接口进行了封装,只要你的智能体框架支持使用 openai 接口调用模型,就可以通过 Trinity-RFT 提供的 `OpenAI` 或是 `AsyncOpenAI` 实例对智能体进行训练。当然,你也可以不使用任何智能体框架,直接借助 Trinity-RFT 提供的 openai 接口实现自己的智能体。 -### 通用多步训练 -现代智能体任务通常涉及多步推理、工具使用和观察。Trinity-RFT 原生支持跨这些多步交互的训练。 -- **逐步步经验生成**:Trinity 不仅从最终结果进行学习,还能将智能体推理轨迹中的每一步视为独立的学习经验(experience)。 -- **奖励分配**:解决任务的奖励(reward)会传播至成功轨迹内的所有 experience,使模型能够学习整个推理链,而不仅仅是最终响应。这由配置中的 `advantage_fn` 控制。 +### 无需修改智能体代码 -### 原生工具调用支持 -Trinity-RFT 的推理引擎和训练流水线专为支持原生 OpenAI `tool_calls` 格式而构建。 +智能体的训练需要收集智能体运行中产生的对话历史以及其他相关信息(例如 `token_id`,`logprobs`),这往往需要对智能体应用代码进行一定的修改。Trinity-RFT 通过封装 `openai.OpenAI` 或 `openai.AsyncOpenAI` 实例的方式,在模型调用时自动收集训练所需的各种信息,从而避免了对智能体自身代码的修改。 -- **学习使用工具**:该框架允许模型学习*何时*调用工具、*调用哪个*工具以及*使用什么*参数,全部采用标准 `tool_calls` 格式。 -- **易操作性**:这种原生支持确保了与任何消费 OpenAI API 格式的服务或环境无缝集成,例如 `MCP_server`(多智能体协作平台)或其他工具使用评估器。 -## 工作原理 +### 支持多轮次交互 -下面我们逐步介绍如何执行此流程。 +智能体任务通常涉及多步推理、工具使用和观察。为了支持训练智能体应用,Trinity-RFT 原生支持包含多轮交互的训练任务,且不限制交互轮次(只需确保每次模型调用的序列长度不超过模型所支持的上限),这意味着你可以根据任务的复杂度,设计动态长度的交互过程。Trinity-RFT 通过动态同步机制,能够在收集到足够的训练样本后立即启动训练任务,从而提升训练效率。 -### 工作流 (`workflow.py`) -核心逻辑封装在 `AgentScopeReactV2MathWorkflow` 类中。 +## 实现流程 -1. **初始化 (`__init__`)** - - 首先初始化 AgentScope 环境和所需的 Agent(`ReActAgentV2`)。 - - 最关键的集成步骤是将 Trinity 的模型客户端注入到 Agent 中: - ```python - self.openai_client = model.get_openai_client() - # self.openai_client = get_openai_async_client() # or async client depend on whether you are using async openai client - # ... - self.agent.model.client = self.openai_client - ``` - 这确保了 Agent 发出的所有 API 请求都通过 Trinity 的 `ModelWrapper` 进行路由,后者会记录完整的对话历史。 +我们将逐步介绍如何使用 Trinity-RFT 训练一个基于 AgentScope 实现的 ReAct 智能体。 -2. **执行 (`run`)** - - `run` 方法非常简洁,它只是将任务描述传递给 Agent。 - ```python - content = self.agent.reply(msg).content # your agent logic - ``` - - 在 Agent 完成其多步推理并产生最终答案后,Trinity 从模型历史中提取所有中间轮次: - ```python - experiences = self.model.extract_experience_from_history(clear_history=True) - ``` - - 基于最终答案计算奖励,并将其应用于从该轨迹生成的所有 `Experience` 对象。然后这些 experiences 被发送到 Buffer 中用于训练。 -### 配置说明 +### 1. 更换智能体的 OpenAI 客户端 -配置文件用于微调整个系统的行为。以下是本示例的关键参数: +{class}`AgentScopeReActAgent ` 封装了 AgentScope 的 ReAct 智能体,并在初始化时注入 Trinity-RFT 提供的 `openai.AsyncOpenAI` 实例,而后续的执行过程均由 AgentScope 智能体自行处理,无需任何修改。 -#### 原生工具调用设置 -`explorer.rollout_model` 部分的这些设置用于配置基于 vLLM 的引擎,以生成和解析兼容 OpenAI 的工具调用。 -我们使用 `Qwen3` 模型并通过 vLLM 托管模型。不同模型的配置可参考 [vLLM Toolcalls](https://docs.vllm.ai/en/stable/features/tool_calling.html#qwen-models) +```python +# A simplified version of trinity.common.workflows.agentscope.react.react_agent.AgentScopeReActAgent +class AgentScopeReActAgent: + def __init__( + self, + openai_client: openai.AsyncOpenAI, # provided by Trinity-RFT + # some other params + ): + """Initialize the AgentScope ReAct agent with specified tools and model. + + Args: + openai_client (openai.AsyncOpenAI): An instance of AsyncOpenAI client. + """ + self.agent_model = OpenAIChatModel( + api_key="EMPTY", + model_name=model_name, + generate_kwargs=generate_kwargs, + stream=False, + ) + # patch the OpenAIChatModel to use the openai_client provided by Trinity-RFT + self.agent_model.client = openai_client + self.agent = ReActAgent( + name="react_agent", + model=self.agent_model, + ) + + async def reply(self, query): + """Generate a response based on the query.""" + # no need to modify your agent logic + return await self.agent.reply( + Msg("user", query, role="user") + ) +``` + +```{note} +这里用一个新类封装 AgentScope 的 ReAct 智能体主要是为了清晰地展示更换 OpenAI 客户端的过程。 +在实践中,你可以直接修改现有智能体的 OpenAI 客户端,而无需创建一个新的类。 +``` + + +### 2. 实现训练工作流 + +{class}`AgentScopeReActWorkflow ` 展示了智能体的训练流程,其核心 `run_async` 方法包含三个步骤: + + 1. 调用智能体完成指定任务并获取任务结果。 + 2. 对任务结果进行评估,计算奖励。 + 3. 收集任务执行中产生的可训练数据并集合奖励生成训练样本(`Experience`)。 + +```python +# A simplified version of trinity.common.workflows.agentscope.react.react_workflow.AgentScopeReActWorkflow +class AgentScopeReActWorkflow(Workflow): + def __init__( + self, + *, + task: Task, + model: ModelWrapper, + auxiliary_models: Optional[List[openai.OpenAI]] = None, + ): + # initialize the agent + self.agent = AgentScopeReActAgent( + openai_client=model.get_openai_async_client(), + # some other params + ) + # get query from the task + self.query = task.raw_task.get(task.format_args.prompt_key) # type: ignore [index] + + async def run_async(self): + """Run the workflow asynchronously.""" + # Step 1: call the ReAct agent to solve the task + response = await self.agent.reply(self.query) + # Step 2: calculate the reward based on the response + reward = await self.calculate_reward(response) + # Step 3: construct experiences from the interaction history and return them + return self.construct_experiences(reward) + + async def calculate_reward(self, response) -> float: + """Calculate the reward based on the response.""" + # your reward logic + + def construct_experiences(self, reward: float) -> List[Experience]: + """Construct experiences from the agent's interaction history. + + Returns: + List: A list of Experience objects. + """ + # Extract all interaction history generated by this task + exps = self.model.extract_experience_from_history() + # update the reward for each experience + for exp in exps: + exp.reward = reward + return exps + +``` + +### 3.训练配置 + +Trinity-RFT 借助配置文件来控制整个训练流程,下面是本示例的关键配置参数说明。 + +#### 推理模型配置 + +`explorer.rollout_model` 部分负责配置智能体应用所使用的模型,其中的关键参数如下: ```yaml explorer: rollout_model: # ... - enable_auto_tool_choice: true # 允许模型生成 `tool_calls` + enable_openai_client: true # 启用 OpenAI Client + enable_history: true # 启用调用历史自动记录 + enable_auto_tool_choice: true # 允许模型生成 `tool_calls` tool_call_parser: hermes # 指定格式化解析工具调用输出的解析器 reasoning_parser: deepseek_r1 # 有助于解析模型的思维过程 - enable_thinking: true # 允许模型生成中间“思考”内容 + enable_thinking: true # 是否启用模型深度思考能力(主要针对 Qwen3 系列模型) ``` -#### 多步训练策略 +#### 多步训练算法 -`algorithm` 部分的此设置定义了如何处理多步 rollout 产生的 experience。 +`algorithm` 部分负责配置智能体应用所使用的训练算法,其中的关键参数如下: ```yaml algorithm: algorithm_type: grpo - advantage_fn: step_wise_grpo # 多步训练的关键 + advantage_fn: step_wise_grpo # 多步训练的关键,该策略告诉 Trinity 为智能体执行路径中的每一步创建独立的训练样本。`grpo` 算法随后使用这些样本来更新模型。 ``` -- `step_wise_grpo`:该策略告诉 Trinity 为智能体执行路径中的每一步创建独立的训练样本。`grpo` 算法随后使用这些样本来更新模型。 -#### 异步同步提升效率 +#### 动态同步配置 -由于多步 rollout 会产生数量不固定的 experience,等待固定数量的 *rollout* 是低效的。我们采用动态同步策略。 +由于智能体应用在完成不同任务时,交互轮次往往不固定,导致生成的训练样本数量也不固定;为此需要开启 Trinity-RFT 的动态同步功能,以便在收集到足够的训练样本后立即启动训练任务,从而提升训练效率。相关配置如下: ```yaml synchronizer: - sync_style: dynamic_by_explorer # 当积累足够 experience 时即开始训练 - sync_interval: 2 + sync_style: dynamic_by_explorer # 当产生足够训练数据时,trainer 立即启动训练任务,而不是将生成的数据补齐到一个固定规模,能够有效提升训练效率 + sync_interval: 2 # 每执行两个批次的任务后检查是否需要同步更新模型参数 ``` -- `sync_style: dynamic_by_explorer`:当缓冲区收集到足够的 *experience*(即单个对话轮次)时,trainer 即启动一次训练任务,而不是等待固定数量的完整智能体轨迹。这显著提高了 GPU 利用率和训练吞吐量。 - -## 如何运行示例 - -1. **前置条件**:确保已安装 Trinity 及本示例所需依赖(如 `AgentScope`)。请参考 [Agentscope Github link](https://github.com/agentscope-ai/agentscope/tree/v0) - -> **注意**:本示例需要以下来源之一的 AgentScope: -> - Commit: `ad13ed5dacecb79d20abf626769f8c7d7a7d2afb` -> - 分支: [`v0`](https://github.com/agentscope-ai/agentscope/tree/v0) - -2. 下载你想使用的模型,并填写 `examples/agentscope_tool_react/agentscopev0_tool_react_gsm8k.yaml` 或 `examples/agentscope_tool_react/agentscopev0_tool_react_dapo.yaml` 中的配置文件 - -3. **启动训练任务**:从仓库根目录运行以下命令。 - ```bash - trinity run --config examples/agentscope_tool_react/agentscopev0_tool_react_gsm8k.yaml - ``` - 或 +## 运行示例 - ```bash - trinity run --config examples/agentscope_tool_react/agentscopev0_tool_react_dapo.yaml - ``` +1. 安装依赖库:按照 [安装指南](/tutorial/installation.md) 成功安装 Trinity-RFT,并且安装了 AgentScope 的 v1.0 及以上版本。 +```bash +pip install agentscope>=1.0.4 +``` -GSM8K 数据集的示例非常简单,在 8 块 H20 GPU 上几分钟内即可收敛。 - -![](../../assets/agentscope_gsm8k_reward.png) +2. 下载模型和数据集: -DAPO 数据集的示例耗时稍长,但也能够收敛。 +```bash +huggingface-cli download Qwen/Qwen3-8B +huggingface-cli download openai/gsm8k --repo-type dataset +``` -![](../../assets/agentscope_dapo_reward.png) +3. 启动训练任务: -我们还可以看到,模型总体上开始更多地使用工具调用来解决问题。 + ```bash + # Navigate to the Trinity-RFT root directory + cd /path/to/Trinity-RFT -![](../../assets/agentscope_dapo_turns.png) + # Run the training for GSM8k dataset: + trinity run --config examples/agentscope_react/gsm8k.yaml + ``` -我们也可以把使用 v1 版本的 AgentScope 仓库,然后对 Qwen3-4b-instrcut-2507 进行训练: -![](../../assets/agentscope_dapo_qwen3-4B_reward.png) +## 结果展示 -## 总结 +reward 变化曲线: -这个示例虽然简单,但展示了 Trinity 在训练使用工具的复杂多步智能体方面的强大功能和灵活性。通过无缝集成外部智能体逻辑,并提供对多步训练和工具调用的原生支持,Trinity-RFT 使你能够高效地在复杂且真实的任务上微调模型。 +![](../../assets/agentscope_gsm8k_reward.png) diff --git a/examples/agentscope_react/README.md b/examples/agentscope_react/README.md new file mode 100644 index 0000000000..8b3a75f651 --- /dev/null +++ b/examples/agentscope_react/README.md @@ -0,0 +1,5 @@ +# AgentScope ReAct Agent Training Example + +This example demonstrates how to train the [AgentScope](https://github.com/agentscope-ai/agentscope) built-in ReAct Agent using Trinity-RFT. We use the GSM8K dataset as an example. Developers can refer to this example to adapt Trinity-RFT's training to their own agent projects. + +Full documentation is available at: https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html diff --git a/examples/agentscope_react/gsm8k.yaml b/examples/agentscope_react/gsm8k.yaml new file mode 100644 index 0000000000..b4f9a79952 --- /dev/null +++ b/examples/agentscope_react/gsm8k.yaml @@ -0,0 +1,74 @@ +project: AgentScope-ReAct +name: GSM8K-Qwen3-8B +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: grpo + repeat_times: 8 + advantage_fn: step_wise_grpo +model: + model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-8B} + max_response_tokens: 16384 + max_model_len: 24576 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + total_epochs: 1 + batch_size: 32 + train_batch_size: 256 + explorer_input: + taskset: + name: gsm8k + storage_type: file + path: 'openai/gsm8k' + subset_name: 'main' + split: 'train' + format: + prompt_key: 'question' + response_key: 'answer' + rollout_args: + temperature: 1.0 + eval_tasksets: [] + default_workflow_type: 'as_react_workflow' + trainer_input: + experience_buffer: + name: agentscope_gsm8k_buffer + storage_type: queue +explorer: + eval_interval: 50 + runner_per_model: 8 + max_timeout: 360 + rollout_model: + engine_num: 4 + tensor_parallel_size: 1 + enable_prefix_caching: false + enforce_eager: true + enable_openai_api: true + enable_history: true + enable_auto_tool_choice: true + tool_call_parser: hermes + reasoning_parser: deepseek_r1 + enable_thinking: true + dtype: bfloat16 + seed: 42 +synchronizer: + sync_style: dynamic_by_explorer + sync_method: 'nccl' + sync_interval: 2 + sync_timeout: 1200 +trainer: + save_interval: 100 + trainer_config: + actor_rollout_ref: + model: + use_remove_padding: true + actor: + use_dynamic_bsz: true + ppo_max_token_len_per_gpu: 24576 + ulysses_sequence_parallel_size: 2 # sp size + ref: + log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz} + log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu} + ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size +monitor: + monitor_type: tensorboard diff --git a/examples/agentscope_tool_react/README.md b/examples/agentscope_tool_react/README.md index 6b0896838c..e7803a80b8 100644 --- a/examples/agentscope_tool_react/README.md +++ b/examples/agentscope_tool_react/README.md @@ -1,3 +1,6 @@ +> This example is deprecated and will be removed in future versions. +> Please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_react.md) for the latest updates. + # Example for Training ReAct Agent for Tool-Integrated Reasoning on GSM8k/DAPO Dataset This example shows how to train a ReAct agent for tool integrated reasoning on GSM8k/DAPO dataset diff --git a/tests/common/vllm_test.py b/tests/common/vllm_test.py index 985d22722f..2376ee3128 100644 --- a/tests/common/vllm_test.py +++ b/tests/common/vllm_test.py @@ -127,6 +127,8 @@ async def test_generate( self, ): await self.model_wrapper.prepare() + self.assertEqual(self.model_wrapper.model_path, self.config.model.model_path) + self.assertEqual(await self.model_wrapper.model_path_async, self.config.model.model_path) prompts = ["Hello, world!", "Hello, my name is"] n = self.config.algorithm.repeat_times if self.use_async: diff --git a/trinity/common/models/model.py b/trinity/common/models/model.py index 928510eb69..b003065dd4 100644 --- a/trinity/common/models/model.py +++ b/trinity/common/models/model.py @@ -58,6 +58,10 @@ def get_api_server_url(self) -> Optional[str]: """Get the API server URL if available.""" return None + def get_model_path(self) -> Optional[str]: + """Get the model path""" + return None + def _history_recorder(func): """Decorator to record history of the model calls.""" @@ -232,6 +236,16 @@ async def model_version_async(self) -> int: """Get the version of the model.""" return await self.model.get_model_version.remote() + @property + def model_path(self) -> str: + """Get the model path.""" + return ray.get(self.model.get_model_path.remote()) + + @property + async def model_path_async(self) -> str: + """Get the model path.""" + return await self.model.get_model_path.remote() + def get_lora_request(self) -> Optional[LoRARequest]: if self.enable_lora: return ray.get(self.model.get_lora_request.remote()) diff --git a/trinity/common/models/vllm_model.py b/trinity/common/models/vllm_model.py index 3a89ccfad5..f118f6d5a4 100644 --- a/trinity/common/models/vllm_model.py +++ b/trinity/common/models/vllm_model.py @@ -491,6 +491,9 @@ async def reset_prefix_cache(self) -> None: def get_model_version(self) -> int: return self.model_version + def get_model_path(self) -> str: + return self.config.model_path + def get_lora_request(self, lora_path: Optional[str] = None) -> LoRARequest: assert self.config.lora_modules is not None lora_request = LoRARequest(**self.config.lora_modules[0]) diff --git a/trinity/common/workflows/__init__.py b/trinity/common/workflows/__init__.py index 230497d1b1..ac119a3b3e 100644 --- a/trinity/common/workflows/__init__.py +++ b/trinity/common/workflows/__init__.py @@ -1,5 +1,8 @@ # -*- coding: utf-8 -*- """Workflow module""" +from trinity.common.workflows.agentscope.react.react_workflow import ( + AgentScopeReActWorkflow, +) from trinity.common.workflows.customized_math_workflows import ( AsyncMathBoxedWorkflow, MathBoxedWorkflow, @@ -80,6 +83,7 @@ "AgentScopeV0ReactMathWorkflow", # will be deprecated soon "AgentScopeReactMathWorkflow", "AgentScopeV1ReactSearchWorkflow", + "AgentScopeReActWorkflow", "EmailSearchWorkflow", "AsyncMathRULERWorkflow", "MathRULERWorkflow", diff --git a/trinity/common/workflows/agentscope/__init__.py b/trinity/common/workflows/agentscope/__init__.py new file mode 100644 index 0000000000..6ce3eac38d --- /dev/null +++ b/trinity/common/workflows/agentscope/__init__.py @@ -0,0 +1 @@ +# This directory contains various agent workflow implementations that utilize the AgentScope framework. diff --git a/trinity/common/workflows/agentscope/react/__init__.py b/trinity/common/workflows/agentscope/react/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/trinity/common/workflows/agentscope/react/react_agent.py b/trinity/common/workflows/agentscope/react/react_agent.py new file mode 100644 index 0000000000..7c3cb2c8b2 --- /dev/null +++ b/trinity/common/workflows/agentscope/react/react_agent.py @@ -0,0 +1,60 @@ +from typing import Dict, Type + +import openai +from agentscope.agent import ReActAgent +from agentscope.formatter import OpenAIChatFormatter +from agentscope.message import Msg +from agentscope.model import OpenAIChatModel +from pydantic import BaseModel + + +class AgentScopeReActAgent: + def __init__( + self, + openai_client: openai.AsyncOpenAI, + model_name: str, + system_prompt: str, + generate_kwargs: dict, + response_structure: Type[BaseModel], + ): + """Initialize the AgentScope ReAct agent with specified tools and model. + + Args: + openai_client (openai.AsyncOpenAI): An instance of AsyncOpenAI client. + model_name (str): The name of the model to use. + system_prompt (str): The system prompt for the agent. + generate_kwargs (dict): Generation parameters for the model. + response_structure (Type[BaseModel]): A Pydantic model defining the expected response structure. + """ + # patch the OpenAIChatModel to use the openai_client provided by Trinity-RFT + self.agent_model = OpenAIChatModel( + api_key="EMPTY", + model_name=model_name, + generate_kwargs=generate_kwargs, + stream=False, + ) + self.agent_model.client = openai_client + self.agent = ReActAgent( + name="react_agent", + sys_prompt=system_prompt, + model=self.agent_model, + formatter=OpenAIChatFormatter(), + # we enable agentscope's meta tool to allow agent to call tools dynamically without pre-registration + enable_meta_tool=True, + ) + self.response_structure = response_structure + + async def reply(self, query: str) -> Dict: + """Generate a response from the agent given a query. + + Args: + query (str): The input query for the agent. + + Returns: + Dict: The structured response. + """ + + response = await self.agent.reply( + Msg("user", query, role="user"), structured_model=self.response_structure + ) + return response.metadata diff --git a/trinity/common/workflows/agentscope/react/react_workflow.py b/trinity/common/workflows/agentscope/react/react_workflow.py new file mode 100644 index 0000000000..247ba085d9 --- /dev/null +++ b/trinity/common/workflows/agentscope/react/react_workflow.py @@ -0,0 +1,107 @@ +"""An example workflow using AgentScope's ReAct agent to solve tasks. + +This workflow is a demonstration of how to integrate the AgentScope framework within the Trinity-RFT workflow system with minimal modifications. +""" + +from typing import Dict, List, Optional, Union + +import openai + +from trinity.common.experience import Experience +from trinity.common.models.model import ModelWrapper +from trinity.common.workflows.workflow import WORKFLOWS, Task, Workflow + +from .templates import TEMPLATE_MAP + + +@WORKFLOWS.register_module("as_react_workflow") +class AgentScopeReActWorkflow(Workflow): + def __init__( + self, + *, + task: Task, + model: ModelWrapper, + auxiliary_models: Optional[List[openai.OpenAI]] = None, + ): + super().__init__( + task=task, + model=model, + auxiliary_models=auxiliary_models, + ) + self.model_client = model.get_openai_async_client() + + task_type = task.workflow_args.get("type", "gsm8k") + template = TEMPLATE_MAP.get(task_type, None) + if template is None: + raise ValueError( + f"Unsupported task type {task_type} for AgentScope ReAct Agent, please add a template first." + ) + # extract the query and the answer from the task + self.query = task.raw_task.get(task.format_args.prompt_key) # type: ignore [index] + self.answer = task.raw_task.get(task.format_args.response_key) # type: ignore [index] + self.reward_fn = template.reward_fn_cls(**task.reward_fn_args) + + # import here to avoid the import error if agentscope is not installed and this workflow is not used + try: + from trinity.common.workflows.agentscope.react.react_agent import ( + AgentScopeReActAgent, + ) + except ImportError as e: + error_message = f"AgentScope is not installed. Please install the agentscope framework first before running the workflow. Error: {str(e)}" + self.logger.error(error_message) + raise ImportError(error_message) + self.agent = AgentScopeReActAgent( + model_name=self.model_client.model_path, + openai_client=self.model_client, + system_prompt=template.system_prompt, + generate_kwargs={ + "temperature": self.rollout_args.get("temperature", 1.0), + "max_tokens": self.rollout_args.get("max_tokens", 4096), + }, + response_structure=template.response_structure, + ) + + async def run_async(self): + """Run the workflow asynchronously.""" + # Step 1: call the react agent to solve the task + response = await self.agent.reply(self.query) + # Step 2: calculate the reward based on the response + reward = await self.calculate_reward(response) + # Step 3: construct experiences from the interaction history and return them + return self.construct_experiences(reward) + + async def calculate_reward(self, response) -> Union[float, Dict[str, float]]: + """Calculate the reward for the workflow. + + Returns: + Union[float, Dict[str, float]]: The reward value or a dictionary of reward value. + """ + return self.reward_fn(response=response, truth=self.answer) + + def construct_experiences(self, reward: Union[float, Dict[str, float]]) -> List[Experience]: + """Construct experiences from the agent's interaction history. + + Args: + reward (Union[float, Dict[str, float]]): The reward value to assign to each experience. + + Returns: + List: A list of Experience objects. + """ + exps = self.model.extract_experience_from_history() + for exp in exps: + exp.reward = reward if isinstance(reward, float) else sum(reward.values()) + exp.metrics = {"react_memory_length": len(self.agent.agent.memory.content)} + # record detailed reward if available + if isinstance(reward, dict): + exp.metrics.update(reward) + return exps + + @property + def asynchronous(self): + """AgentScope's ReAct agent only supports asynchronous calls, so we set this to True.""" + return True + + @property + def repeatable(self): + """This workflow is not repeatable.""" + return False diff --git a/trinity/common/workflows/agentscope/react/templates.py b/trinity/common/workflows/agentscope/react/templates.py new file mode 100644 index 0000000000..ab7e20ae13 --- /dev/null +++ b/trinity/common/workflows/agentscope/react/templates.py @@ -0,0 +1,59 @@ +from dataclasses import dataclass +from typing import Dict, Optional, Type + +from pydantic import BaseModel, Field + +from trinity.common.rewards import MathBoxedRewardFn, RewardFn + +# For GSM8K task +GSM8KSystemPrompt = """You are an agent specialized in solving math problems with tools. Please solve the math problem given to you. You can write and execute Python code to perform calculation or verify your answer. You should return your final answer within \\boxed{{}}.""" + + +class GSM8KResponseStructure(BaseModel): + result: str = Field( + description="Your solution of the given math problem. Put your final answer in boxed format, e.g., \\boxed{42}" + ) + + +class GSM8KRewardFn(MathBoxedRewardFn): + def __call__( # type: ignore [override] + self, + response: dict, + truth: str, + format_score_coef: float = 0.1, + **kwargs, + ) -> dict[str, float]: + # parse GSM8K truth + if isinstance(truth, str) and "####" in truth: + truth = truth.split("####")[1].strip() + else: + truth = str(truth) + return super().__call__( + response=response["result"], + truth=truth, + with_think=False, + format_score_coef=format_score_coef, + **kwargs, + ) + + +# Registry for different templates + + +@dataclass +class Template: + """A template for different task types, including system prompt and response structure.""" + + system_prompt: str + response_structure: Type[BaseModel] + reward_fn_cls: Type[RewardFn] + + +TEMPLATE_MAP: Dict[str, Optional[Template]] = { + "gsm8k": Template( + system_prompt=GSM8KSystemPrompt, + response_structure=GSM8KResponseStructure, + reward_fn_cls=GSM8KRewardFn, + ), + # Add more templates for different task types as needed +}