diff --git a/README.md b/README.md index eee0b6ca45..bb90d042db 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@ ## 🚀 News +* [2025-08] We now support training on general multi-step workflows! Please check out examples for [ALFWorld](./docs/sphinx_doc/source/tutorial/example_step_wise.md) and [ReAct](./docs/sphinx_doc/source/tutorial/example_react.md). * [2025-07] Trinity-RFT v0.2.0 is released. * [2025-07] We update the [technical report](https://arxiv.org/abs/2505.17826) (arXiv v2) with new features, examples, and experiments. * [2025-06] Trinity-RFT v0.1.1 is released. @@ -230,7 +231,7 @@ huggingface-cli download {model_name} --local-dir $MODEL_PATH/{model_name} modelscope download {model_name} --local_dir $MODEL_PATH/{model_name} ``` -For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or [ModelScope](https://modelscope.cn/docs/models/download). +For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or [ModelScope](https://modelscope.cn/docs/models/download). @@ -331,7 +332,12 @@ Tutorials for running different RFT modes: Tutorials for adapting Trinity-RFT to a new multi-turn agentic scenario: -+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md) ++ [Concatenated Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md) + +Tutorials for adapting Trinity-RFT to a general multi-step agentic scenario: + ++ [General Multi-Step tasks](./docs/sphinx_doc/source/tutorial/example_step_wise.md) ++ [ReAct agent tasks](./docs/sphinx_doc/source/tutorial/example_react.md) Tutorials for data-related functionalities: diff --git a/README_zh.md b/README_zh.md index 6d0f8df8f2..d2851f2d39 100644 --- a/README_zh.md +++ b/README_zh.md @@ -22,6 +22,7 @@ ## 🚀 最新动态 +* [2025-08] Trinity-RFT 现在已经支持通用多轮工作流的训练了,请参考 [ALFWorld](./docs/sphinx_doc/source/tutorial/example_step_wise.md) 和 [ReAct](./docs/sphinx_doc/source/tutorial/example_react.md) 的例子! * [2025-07] 发布 Trinity-RFT v0.2.0 版本,新增了多项功能优化。 * [2025-07] 更新了[技术报告](https://arxiv.org/abs/2505.17826) (arXiv v2),增加了新功能、示例和实验。 * [2025-06] 发布 Trinity-RFT v0.1.1 版本,修复了已知问题并提升系统稳定性。 @@ -334,6 +335,12 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml + [多轮任务](./docs/sphinx_doc/source/tutorial/example_multi_turn.md) +将 Trinity-RFT 适配到通用多轮智能体场景的教程: + ++ [通用多轮任务](./docs/sphinx_doc/source/tutorial/example_step_wise.md) ++ [ReAct智能体任务](./docs/sphinx_doc/source/tutorial/example_react.md) + + 数据相关功能的教程: + [高级数据处理及Human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md) diff --git a/docs/sphinx_doc/assets/alfworldv2_reward.png b/docs/sphinx_doc/assets/alfworldv2_reward.png new file mode 100644 index 0000000000..9eca788f70 Binary files /dev/null and b/docs/sphinx_doc/assets/alfworldv2_reward.png differ diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 062e9e9e7f..34f67a32c3 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -20,6 +20,8 @@ Welcome to Trinity-RFT's documentation! tutorial/example_reasoning_advanced.md tutorial/example_async_mode.md tutorial/example_multi_turn.md + tutorial/example_step_wise.md + tutorial/example_react.md tutorial/example_dpo.md tutorial/example_data_functionalities.md diff --git a/docs/sphinx_doc/source/main.md b/docs/sphinx_doc/source/main.md index dcfb3fde34..4424642322 100644 --- a/docs/sphinx_doc/source/main.md +++ b/docs/sphinx_doc/source/main.md @@ -8,6 +8,7 @@ ## 🚀 News +* [2025-08] We now support training on general multi-step workflows! Please check out examples for [ALFWorld](./docs/sphinx_doc/source/tutorial/example_step_wise.md) and [ReAct](./docs/sphinx_doc/source/tutorial/example_react.md). * [2025-07] Trinity-RFT v0.2.0 is released. * [2025-07] We update the [technical report](https://arxiv.org/abs/2505.17826) (arXiv v2) with new features, examples, and experiments. * [2025-06] Trinity-RFT v0.1.1 is released. @@ -309,7 +310,12 @@ Tutorials for running different RFT modes: Tutorials for adapting Trinity-RFT to a new multi-turn agentic scenario: -+ [Multi-turn tasks](/tutorial/example_multi_turn.md) ++ [Concatenated Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md) + +Tutorials for adapting Trinity-RFT to a general multi-step agentic scenario: + ++ [General Multi-Step tasks](./docs/sphinx_doc/source/tutorial/example_step_wise.md) ++ [ReAct agent tasks](./docs/sphinx_doc/source/tutorial/example_react.md) Tutorials for data-related functionalities: diff --git a/docs/sphinx_doc/source/tutorial/example_multi_turn.md b/docs/sphinx_doc/source/tutorial/example_multi_turn.md index 1212b9dcf4..6cc28690e6 100644 --- a/docs/sphinx_doc/source/tutorial/example_multi_turn.md +++ b/docs/sphinx_doc/source/tutorial/example_multi_turn.md @@ -1,4 +1,4 @@ -# Multi-Turn RFT +# Concatenated Multi-Turn RFT In Trinity-RFT, we support Agentic RL with multiple rounds of interaction with environments. diff --git a/docs/sphinx_doc/source/tutorial/example_react.md b/docs/sphinx_doc/source/tutorial/example_react.md new file mode 100644 index 0000000000..2cec955c72 --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/example_react.md @@ -0,0 +1,136 @@ + +# Multi-Step ReAct + +This example serves as a demonstration for adapting the Trinity-RFT training workflow to your own agentic project, through our OpenAI-compatible `ModelWrapper` class. + +Here, we use the [AgentScope](https://github.com/modelscope/agentscope) framework as an example, but you can certainly use any other framework, as Trinity offers great flexibility. This example fine-tunes a model on the GSM8K math dataset by leveraging an agent that uses ReAct-style reasoning with native tool calls. + +## Key Features Demonstrated + +This example highlights several advanced capabilities of the Trinity-RFT framework: + +### Seamless Integration with External Agent Frameworks +Trinity-RFT is designed to be highly modular. You can easily embed complex, pre-existing agent logic from external frameworks like AgentScope directly into a Trinity `Workflow`. + +- **No Need for Rewrites**: You don't have to re-implement the intricate logic of your agent (e.g., the ReAct loop, memory management, or tool invocation) within Trinity. +- **Focus on High-Level Orchestration**: As shown in our `AgentScopeReactV2MathWorkflow`, the Trinity workflow simply initializes and calls the external agent's `reply` method. Trinity abstracts away the underlying complexity, allowing you to focus on the high-level task orchestration and reward design. + +### General Multi-Step Training +Modern agentic tasks often involve multiple steps of reasoning, tool use, and observation. Trinity-RFT natively supports training across these Multi-Step interactions. + +- **Step-Wise Experience Generation**: Instead of only learning from the final answer, Trinity can treat each step within an agent's reasoning trajectory as a distinct learning opportunity. +- **Credit Assignment**: The reward for solving a task is propagated back to all experiences within the successful trajectory, enabling the model to learn the entire reasoning chain, not just the final response. This is controlled by the `add_strategy` in the config. + +### Native Tool Calling Support +Trinity-RFT's inference engine and training pipeline are built to support the native OpenAI `tool_calls` format. + +- **Direct Training on Tool Use**: The framework allows the model to be trained on deciding *when* to call a tool, *which* tool to call, and *what* arguments to use, all formatted in the standard `tool_calls` convention. +- **Interoperability**: This native support ensures seamless integration with any service or environment that consumes the OpenAI API format, such as an `MCP_server` (Multi-Agent Collaboration Platform) or other tool-use evaluators. + +## How It Works + +Below we show you how to perform this step-by-step. + +### The Workflow (`workflow.py`) + +The core logic is encapsulated in the `AgentScopeReactV2MathWorkflow` class. + +1. **Initialization (`__init__`)**: + - It first initializes the AgentScope environment and the desired agent (`ReActAgentV2`). + - The most critical integration step is injecting Trinity's model client into the AgentScope agent: + ```python + self.openai_client = model.get_openai_client() + # ... + self.agent.model.client = self.openai_client + ``` + This ensures that all API calls made by the AgentScope agent are routed through Trinity's `ModelWrapper`, which records the entire conversation history. + +2. **Execution (`run`)**: + - The `run` method is remarkably simple. It just passes the task description to the agent. + ```python + content = self.agent.reply(msg).content + ``` + - After the agent completes its multi-step reasoning and produces a final answer, Trinity extracts all the intermediate turns from the model's history: + ```python + experiences = self.model.extract_experience_from_history(clear_history=True) + ``` + - A reward is calculated based on the final answer and is applied to all `Experience` objects generated from the trajectory. These experiences are then sent to the buffer for training. + +### Configuration + +The configuration file fine-tunes the behavior of the entire system. Here are the key parameters for this example: + +#### Native Tool Calling Settings + +These settings in the `explorer.rollout_model` section configure the VLLM-based engine to generate and parse OpenAI-compatible tool calls. +We use the `Qwen3` model and host model with vllm. The configuration for different model can be found in [VLLM Toolcalls](https://docs.vllm.ai/en/stable/features/tool_calling.html#qwen-models) + + +```yaml +explorer: + rollout_model: + # ... + enable_auto_tool_choice: true # Enables the model to generate `tool_calls` + tool_call_parser: hermes # Specifies the parser for formatting tool call outputs + reasoning_parser: deepseek_r1 # Helps in parsing the model's thought process + enable_thinking: true # Enables the model to generate intermediate "thoughts" +``` + +#### Multi-Step Training Strategy + +This setting in the `algorithm` section defines how experiences from a Multi-Step rollout are processed. + +```yaml +algorithm: + algorithm_type: grpo + add_strategy: step_wise_grpo # Key for Multi-Step training +``` +- `step_wise_grpo`: This strategy tells Trinity to create a distinct training sample for each step in the agent's execution path. The `grpo` algorithm then uses these samples to update the model. + +#### Asynchronous Synchronization for Efficiency + +Because Multi-Step rollouts produce a variable number of experiences, waiting for a fixed number of *rollouts* is inefficient. We use a dynamic synchronization strategy. + +```yaml +synchronizer: + sync_style: dynamic_by_explorer # Start training when enough experiences are ready + sync_interval: 2 +``` +- `sync_style: dynamic_by_explorer`: The trainer starts a training job as soon as the buffer has collected enough *experiences* (i.e., individual turns), rather than waiting for a fixed number of full agent trajectories. This significantly improves GPU utilization and training throughput. + +## How to Run the Example + +1. **Prerequisites**: Ensure you have Trinity installed, along with the dependencies for this example (e.g., `agentscope`). Please refer to [Agentscope Github link](https://github.com/modelscope/agentscope). + +2. Download the model you want to use, and fill in the configuration files in `examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml` or `examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml` + +3. **Launch the training job**: Run the following command from the root directory of the repository. + + ```bash + trinity run --config examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml + ``` + + or + + ```bash + trinity run --config examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml + ``` + + +The example here for gsm8k dataset is really simple and it can converge in a few minutes on 8 H20 GPUs. + +![](../../assets/agentscope_gsm8k_reward.png) + +The example here for dapo dataset take a little bit longer, but it also converges. + +![](../../assets/agentscope_dapo_reward.png) + +We can also see that the model generally start to use more tool calls to solve the problems. + +![](../../assets/agentscope_dapo_turns.png) + + + +## Summary + +This example is simple but demonstrates the power and flexibility of Trinity for training complex, Multi-Step agents that use tools. By seamlessly integrating external agentic logic and providing native support for Multi-Step training and tool calls, Trinity-RFT empowers you to fine-tune models on sophisticated, realistic tasks with high efficiency. diff --git a/docs/sphinx_doc/source/tutorial/example_step_wise.md b/docs/sphinx_doc/source/tutorial/example_step_wise.md new file mode 100644 index 0000000000..f2b772fde7 --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/example_step_wise.md @@ -0,0 +1,201 @@ +# General Multi-Step RFT + +In Trinity-RFT, we support general multi-step RFTs, which can be used to train agents by interacting with environments in multiple rounds. + +Different from the [multi-turn RFT](./example_multi_turn.md) that concatenates the interaction results into one single `Experience`, this approach treats each step as an individual `Experience`, enabling RL agents to handle longer contexts. + +We will now illustrate the general multi-step workflow using ALFWorld. For a hands-on look, you can skip directly to the [code implementation](#example-multi-step-alfworld). + +## Build a general step-wise workflow + +### Basic concept + +In Trinity, we provide two types of general step-wise workflows: `StepWiseRewardWorkflow` and `RewardPropagationWorkflow`. These workflows setup the basic structure of a step-wise workflow and returns the a list of `experiences` in each run. Their difference is `StepWiseRewardWorkflow` computes the reward for each step and `RewardPropagationWorkflow` computes the reward after all steps and propagates the reward to the previous steps. See `trinity/common/workflows/step_wise_workflow.py` for more details. + +To build a new workflow, you mainly need to identify each interaction step in `step()` and the reward function in `reward()`. For example, the core code of ALFWorld workflow is shown as follows: + + +```python +class StepWiseAlfworldWorkflow(RewardPropagationWorkflow): + ... + + def step(self, step_num: int) -> bool: + if self.done: + return False + + # Format observation for the model + format_obs = format_observation(self.observation) # type: ignore + self.memory.append({"role": "user", "content": format_obs}) + + # Get action from the model + responses = self.model.chat(self.memory) + response_text = responses[0].response_text + self.memory.append({"role": "assistant", "content": response_text}) + action = parse_action(response_text) + + # Execute action in the environment + observation, reward, done, info = self.env.step(action) + + # Update internal state + self.observation = observation + self.done = done + if self.done: + self.final_reward = reward + + # Return False to stop the run if the episode is done + return not self.done + + def reward(self, exps: list[Experience]) -> float: + return self.final_reward +``` + +Also, remember to register your workflow: +```python +@WORKFLOWS.register_module("step_wise_alfworld_workflow") +class StepWiseAlfworldWorkflow(RewardPropagationWorkflow): + """A step-wise workflow for alfworld task.""" + ... +``` + +and include it in the init file `trinity/common/workflows/__init__.py` + +```diff + # -*- coding: utf-8 -*- + """Workflow module""" + from .workflow import WORKFLOWS, MathWorkflow, SimpleWorkflow ++from .envs.alfworld.alfworld_workflow import StepWiseAlfworldWorkflow + + __all__ = [ + "WORKFLOWS", + "SimpleWorkflow", + "MathWorkflow", ++ "StepWiseAlfworldWorkflow", + ] +``` + +### Other Configuration + +In general multi-step scenarios, each run may generate various number of experiences. To accomodate this case, we provide some flexible designs. + +- `algorithm.add_strategy = step_wise_grpo`: This function allows you compute the advantages for the collected experience before adding to the buffer. For this example, we use `step_wise_grpo` which broadcasts advantages from the last step to previous steps. + +- `buffer.train_batch_size`: The number of experiences to be sampled from the buffer for training, which can be different from the number of generated experiences in each explore step. + +- `buffer.trainer_input.use_priority_queue = true`: Using `PriorityQueue` allows the model to use the experiences with higher priority. + +- `synchronizer.sync_style = dynamic_by_explorer`: The explorer determines when to synchronize the model weights with the trainer. + + +The example configuration is shown as: + +```yaml +project: "ALFWORLD" +name: "Step_Wise_Alfworld" +checkpoint_root_dir: /PATH/TO/CHECKPOINT/ALFWORLD_RFT/ +algorithm: + algorithm_type: grpo + repeat_times: 16 + add_strategy: step_wise_grpo +model: + model_path: /PATH/TO/MODEL/ + max_response_tokens: 16384 + max_model_len: 20480 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + total_epochs: 20 + batch_size: 16 + train_batch_size: 7680 # here: batch_size * repeat_times * max_env_steps + max_retry_times: 3 + max_retry_interval: 1 + explorer_input: + taskset: + name: alfworld + storage_type: file + path: 'examples/grpo_alfworld/alfworld_data' # PATH TO ALFWORLD DATA + format: + prompt_key: 'game_file' + rollout_args: + temperature: 1.0 + logprobs: 0 + workflow_args: + max_env_steps: 30 + enable_progress_bar: false + default_workflow_type: 'step_wise_alfworld_workflow' + trainer_input: + experience_buffer: + name: alfworld_buffer + storage_type: queue + use_priority_queue: true +explorer: + max_repeat_times_per_runner: 1 + runner_num: 32 + max_timeout: 3600 + rollout_model: + enable_history: true + engine_num: 2 + tensor_parallel_size: 2 + enable_prefix_caching: false + enforce_eager: true + dtype: bfloat16 + seed: 42 + gpu_memory_utilization: 0.7 + enable_chunked_prefill: true + env_vars: + TMPDIR: /PATH/TO/ALFWORLD_TMP_DIR +synchronizer: + sync_style: dynamic_by_explorer + sync_method: 'nccl' + sync_interval: 2 + sync_timeout: 3600 +trainer: + trainer_type: 'verl' + trainer_config_path: 'examples/grpo_alfworld_general_multi_step/train_alfworld.yaml' + save_interval: 50 +``` + + + +Below, we provide the commands for running the ALFWorld task. + +## Example: Multi-Step ALFWorld +### Environment Preparation +To install the ALFworld environment, you can follow the instructions below. + +1. Pip install: `pip install alfworld[full]` + +2. Export the path: `export ALFWORLD_DATA=/path/to/alfworld/data` + +3. Download the environment: `alfworld-download` + +Now you can find the environment in `$ALFWORLD_DATA` and continue with the following steps. + +You may refer to the original [repository](https://github.com/alfworld/alfworld) for more details. + +### Data Preparation +Our dataset follows the format in Huggingface datasets library, so we should correspondingly convert our env dataset. + +Just check the data preparation scripts and run the following command. +```bash +python examples/grpo_alfworld/get_alfworld_data.py +``` + +The task is described as an environment instead of a single prompt. The task description is the `game_file` file path. + + +### Config preparation and run the experiment + +The default config file is [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step/alfworld.yaml). +You may revise the configurations properly and run the experiment! + +```bash +trinity run --config examples/grpo_alfworld_general_multi_step/alfworld.yaml +``` + +The results are shown in the following figure. + +![](../../assets/alfworldv2_reward.png) + + +Note that we use a Qwen2.5-3B model fine-tuned with SFT as our starting point, ensuring that the model has some basic understanding of the environment. diff --git a/examples/agentscope_tool_react/README.md b/examples/agentscope_tool_react/README.md index df08bf1ec7..9d1c77d441 100644 --- a/examples/agentscope_tool_react/README.md +++ b/examples/agentscope_tool_react/README.md @@ -1,137 +1,7 @@ +# ReAct on GSM8K and MATH Dataset -# Training Using Complex Agent Workflows. +This example shows how to train ReAct agent on GSM8K and MATH Dataset. -This example serves as a demonstration for adapting the Trinity-RFT training workflow to your own agentic project, through our OpenAI-compatible `ModelWrapper` class. +For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_react.md). -Here, we use the [AgentScope](https://github.com/modelscope/agentscope) framework as an example, but you can certainly use any other framework, as Trinity offers great flexibility. This example fine-tunes a model on the GSM8K math dataset by leveraging an agent that uses ReAct-style reasoning with native tool calls. - -## Key Features Demonstrated - -This example highlights several advanced capabilities of the Trinity-RFT framework: - -### Seamless Integration with External Agent Frameworks -Trinity-RFT is designed to be highly modular. You can easily embed complex, pre-existing agent logic from external frameworks like AgentScope directly into a Trinity `Workflow`. - -- **No Need for Rewrites**: You don't have to re-implement the intricate logic of your agent (e.g., the ReAct loop, memory management, or tool invocation) within Trinity. -- **Focus on High-Level Orchestration**: As shown in our `AgentScopeReactV2MathWorkflow`, the Trinity workflow simply initializes and calls the external agent's `reply` method. Trinity abstracts away the underlying complexity, allowing you to focus on the high-level task orchestration and reward design. - -### General Multi-Turn Training -Modern agentic tasks often involve multiple steps of reasoning, tool use, and observation. Trinity-RFT natively supports training across these multi-turn interactions. - -- **Step-Wise Experience Generation**: Instead of only learning from the final answer, Trinity can treat each step within an agent's reasoning trajectory as a distinct learning opportunity. -- **Credit Assignment**: The reward for solving a task is propagated back to all experiences within the successful trajectory, enabling the model to learn the entire reasoning chain, not just the final response. This is controlled by the `add_strategy` in the config. - -### Native Tool Calling Support -Trinity-RFT's inference engine and training pipeline are built to support the native OpenAI `tool_calls` format. - -- **Direct Training on Tool Use**: The framework allows the model to be trained on deciding *when* to call a tool, *which* tool to call, and *what* arguments to use, all formatted in the standard `tool_calls` convention. -- **Interoperability**: This native support ensures seamless integration with any service or environment that consumes the OpenAI API format, such as an `MCP_server` (Multi-Agent Collaboration Platform) or other tool-use evaluators. - -## How It Works - -Below we show you how to perform this step-by-step. - -### The Workflow (`workflow.py`) - -The core logic is encapsulated in the `AgentScopeReactV2MathWorkflow` class. - -1. **Initialization (`__init__`)**: - - It first initializes the AgentScope environment and the desired agent (`ReActAgentV2`). - - The most critical integration step is injecting Trinity's model client into the AgentScope agent: - ```python - self.openai_client = model.get_openai_client() - # ... - self.agent.model.client = self.openai_client - ``` - This ensures that all API calls made by the AgentScope agent are routed through Trinity's `ModelWrapper`, which records the entire conversation history. - -2. **Execution (`run`)**: - - The `run` method is remarkably simple. It just passes the task description to the agent. - ```python - content = self.agent.reply(msg).content - ``` - - After the agent completes its multi-step reasoning and produces a final answer, Trinity extracts all the intermediate turns from the model's history: - ```python - experiences = self.model.extract_experience_from_history(clear_history=True) - ``` - - A reward is calculated based on the final answer and is applied to all `Experience` objects generated from the trajectory. These experiences are then sent to the buffer for training. - -### The Configuration (`config.yaml`) - -The configuration file fine-tunes the behavior of the entire system. Here are the key parameters for this example: - -#### Native Tool Calling Settings - -These settings in the `explorer.rollout_model` section configure the VLLM-based engine to generate and parse OpenAI-compatible tool calls. -We use the `Qwen3` model and host model with vllm. The configuration for different model can be found in [VLLM Toolcalls](https://docs.vllm.ai/en/stable/features/tool_calling.html#qwen-models) - - -```yaml -explorer: - rollout_model: - engine_type: vllm_async - # ... - enable_auto_tool_choice: true # Enables the model to generate `tool_calls` - tool_call_parser: hermes # Specifies the parser for formatting tool call outputs - reasoning_parser: deepseek_r1 # Helps in parsing the model's thought process - enable_thinking: true # Enables the model to generate intermediate "thoughts" -``` - -#### Multi-Turn Training Strategy - -This setting in the `algorithm` section defines how experiences from a multi-turn rollout are processed. - -```yaml -algorithm: - algorithm_type: grpo - add_strategy: step_wise_grpo # Key for multi-turn training -``` -- `step_wise_grpo`: This strategy tells Trinity to create a distinct training sample for each step in the agent's execution path. The `grpo` algorithm then uses these samples to update the model. - -#### Asynchronous Synchronization for Efficiency - -Because multi-turn rollouts produce a variable number of experiences, waiting for a fixed number of *rollouts* is inefficient. We use a dynamic synchronization strategy. - -```yaml -synchronizer: - sync_style: dynamic_by_explorer # Start training when enough experiences are ready - sync_interval: 2 -``` -- `sync_style: dynamic_by_explorer`: The trainer starts a training job as soon as the buffer has collected enough *experiences* (i.e., individual turns), rather than waiting for a fixed number of full agent trajectories. This significantly improves GPU utilization and training throughput. - -## How to Run the Example - -1. **Prerequisites**: Ensure you have Trinity installed, along with the dependencies for this example (e.g., `agentscope`). Please refer to [Agentscope Github link](https://github.com/modelscope/agentscope). - -2. Download the model you want to use, and fill in the configuration files in `examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml` or `examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml` - -3. **Launch the training job**: Run the following command from the root directory of the repository. - - ```bash - trinity run --config examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml - ``` - - or - - ```bash - trinity run --config examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml - ``` - - -The example here for gsm8k dataset is really simple and it can converge in a few minutes on 8 H20 GPUs. - -![](../../docs/sphinx_doc/assets/agentscope_gsm8k_reward.png) - -The example here for dapo dataset take a little bit longer, but it also converges. - -![](../../docs/sphinx_doc/assets/agentscope_dapo_reward.png) - -We can also see that the model generally start to use more tool calls to solve the problems. - -![](../../docs/sphinx_doc/assets/agentscope_dapo_turns.png) - - - -## Summary - -This example is simple but demonstrates the power and flexibility of Trinity for training complex, multi-turn agents that use tools. By seamlessly integrating external agentic logic and providing native support for multi-turn training and tool calls, Trinity-RFT empowers you to fine-tune models on sophisticated, realistic tasks with high efficiency. +The config files are located in [`alfworld.yaml`](alfworld.yaml) and [`train_alfworld.yaml`](train_alfworld.yaml). diff --git a/examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml b/examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml index d8ab2b3694..8e96958cde 100644 --- a/examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml +++ b/examples/agentscope_tool_react/agentscope_tool_react_dapo.yaml @@ -5,7 +5,6 @@ algorithm: algorithm_type: grpo repeat_times: 8 add_strategy: step_wise_grpo - model: model_path: /PATH/TO/MODEL/Qwen3-8B max_response_tokens: 16384 @@ -16,6 +15,7 @@ cluster: buffer: total_epochs: 1 batch_size: 32 + train_batch_size: 512 max_retry_times: 3 max_retry_interval: 1 explorer_input: @@ -42,7 +42,6 @@ explorer: runner_num: 4 max_timeout: 360 rollout_model: - engine_type: vllm_async engine_num: 4 tensor_parallel_size: 1 enable_prefix_caching: false diff --git a/examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml b/examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml index 951de8c578..9a31f2953e 100644 --- a/examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml +++ b/examples/agentscope_tool_react/agentscope_tool_react_gsm8k.yaml @@ -5,7 +5,6 @@ algorithm: algorithm_type: grpo repeat_times: 8 add_strategy: step_wise_grpo - model: model_path: /PATH/TO/MODEL/Qwen3-4B max_response_tokens: 16384 @@ -16,6 +15,7 @@ cluster: buffer: total_epochs: 1 batch_size: 32 + train_batch_size: 256 max_retry_times: 3 max_retry_interval: 1 explorer_input: @@ -42,7 +42,6 @@ explorer: runner_num: 4 max_timeout: 360 rollout_model: - engine_type: vllm_async engine_num: 4 tensor_parallel_size: 1 enable_prefix_caching: false diff --git a/examples/grpo_alfworld_general_multi_step/README.md b/examples/grpo_alfworld_general_multi_step/README.md new file mode 100644 index 0000000000..d7f18e6238 --- /dev/null +++ b/examples/grpo_alfworld_general_multi_step/README.md @@ -0,0 +1,13 @@ +# ALFWorld with general multi-step workflow + +This example shows an updated implementation for training ALFWorld, now built with a general, multi-step workflow. +Please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_step_wise.md) for more details. + +The config files are located in [`alfworld.yaml`](alfworld.yaml) and [`train_alfworld.yaml`](train_alfworld.yaml). + + +The training performance of this example is shown as follows: + +
+ Reward Curve +
diff --git a/examples/grpo_alfworld_general_multi_step/alfworld.yaml b/examples/grpo_alfworld_general_multi_step/alfworld.yaml new file mode 100644 index 0000000000..5685c574ba --- /dev/null +++ b/examples/grpo_alfworld_general_multi_step/alfworld.yaml @@ -0,0 +1,66 @@ +project: "ALFWORLD" +name: "Step_Wise_Alfworld" +checkpoint_root_dir: /PATH/TO/CHECKPOINT/ALFWORLD_RFT/ +algorithm: + algorithm_type: grpo + repeat_times: 16 + add_strategy: step_wise_grpo +model: + model_path: /PATH/TO/MODEL/ + max_response_tokens: 16384 + max_model_len: 20480 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + total_epochs: 20 + batch_size: 16 + train_batch_size: 7680 # 16 * 16 * 30 + max_retry_times: 3 + max_retry_interval: 1 + explorer_input: + taskset: + name: alfworld + storage_type: file + path: 'examples/grpo_alfworld/alfworld_data' # PATH TO ALFWORLD DATA + format: + prompt_key: 'game_file' + rollout_args: + temperature: 1.0 + logprobs: 0 + workflow_args: + max_env_steps: 30 + enable_progress_bar: false + default_workflow_type: 'step_wise_alfworld_workflow' + trainer_input: + experience_buffer: + name: alfworld_buffer + storage_type: queue + use_priority_queue: true +explorer: + max_repeat_times_per_runner: 1 + runner_num: 32 + max_timeout: 3600 + rollout_model: + enable_history: true + engine_num: 2 + tensor_parallel_size: 2 + enable_prefix_caching: false + enforce_eager: true + dtype: bfloat16 + seed: 42 + gpu_memory_utilization: 0.7 + enable_chunked_prefill: true + env_vars: + TMPDIR: /PATH/TO/ALFWORLD_TMP_DIR +synchronizer: + sync_style: dynamic_by_explorer + sync_method: 'nccl' + sync_interval: 2 + sync_timeout: 3600 +trainer: + trainer_type: 'verl' + trainer_config_path: 'examples/grpo_alfworld_general_multi_step/train_alfworld.yaml' + save_interval: 50 +monitor: + monitor_type: 'wandb' diff --git a/examples/grpo_alfworld_general_multi_step/train_alfworld.yaml b/examples/grpo_alfworld_general_multi_step/train_alfworld.yaml new file mode 100644 index 0000000000..a59982f49f --- /dev/null +++ b/examples/grpo_alfworld_general_multi_step/train_alfworld.yaml @@ -0,0 +1,49 @@ +actor_rollout_ref: + hybrid_engine: True + model: + external_lib: null + override_config: { } + enable_gradient_checkpointing: True + use_remove_padding: False + actor: + strategy: fsdp # This is for backward-compatibility + ppo_micro_batch_size_per_gpu: 1 + use_dynamic_bsz: False + ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} + grad_clip: 1.0 + ppo_epochs: 1 + shuffle: False + ulysses_sequence_parallel_size: 1 # sp size + optim: + lr: 5e-6 + lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime + # min_lr_ratio: null # only useful for warmup with cosine + warmup_style: constant # select from constant/cosine + total_training_steps: -1 # must be override by program + fsdp_config: + wrap_policy: + # transformer_layer_cls_to_wrap: None + min_num_params: 0 + param_offload: False + optimizer_offload: False + fsdp_size: -1 + ref: + fsdp_config: + param_offload: False + wrap_policy: + # transformer_layer_cls_to_wrap: None + min_num_params: 0 + log_prob_micro_batch_size_per_gpu: 1 + log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz} + log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu} + ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size + +trainer: + balance_batch: True + # total_training_steps: null + # auto: find the last ckpt to resume. If can't find, start from scratch + resume_mode: auto # or auto or resume_path if + default_hdfs_dir: null + remove_previous_ckpt_in_save: False + del_local_ckpt_after_load: False + val_before_train: False diff --git a/trinity/common/verl_config.py b/trinity/common/verl_config.py index e203378987..4041ec1a67 100644 --- a/trinity/common/verl_config.py +++ b/trinity/common/verl_config.py @@ -13,7 +13,7 @@ @dataclass class Data: - train_batch_size: int = 1024 # kept for RayPPOTrainer._validate_config + train_batch_size: int = 1024 # kept to pass RayPPOTrainer._validate_config @dataclass @@ -315,6 +315,9 @@ def synchronize_config(self, config: Config) -> None: # noqa: C901 self.trainer.resume_mode = "auto" self.buffer = config.buffer + self.data.train_batch_size = ( + config.buffer.train_batch_size + ) # kept to pass RayPPOTrainer._validate_config self.synchronizer = config.synchronizer self.actor_rollout_ref.synchronizer = config.synchronizer diff --git a/trinity/common/workflows/__init__.py b/trinity/common/workflows/__init__.py index d5321976c9..ebafdb066c 100644 --- a/trinity/common/workflows/__init__.py +++ b/trinity/common/workflows/__init__.py @@ -3,7 +3,7 @@ from .customized_math_workflows import MathBoxedWorkflow from .customized_toolcall_workflows import ToolCallWorkflow from .envs.agentscope.agentscope_react_workflow import AgentScopeReactV2MathWorkflow -from .envs.alfworld.alfworld_workflow import AlfworldWorkflow +from .envs.alfworld.alfworld_workflow import AlfworldWorkflow, StepWiseAlfworldWorkflow from .envs.sciworld.sciworld_workflow import SciWorldWorkflow from .envs.webshop.webshop_workflow import WebShopWorkflow from .eval_workflow import MathEvalWorkflow @@ -18,6 +18,7 @@ "MathWorkflow", "WebShopWorkflow", "AlfworldWorkflow", + "StepWiseAlfworldWorkflow", "SciWorldWorkflow", "MathBoxedWorkflow", "MathRMWorkflow", diff --git a/trinity/common/workflows/envs/alfworld/alfworld_workflow.py b/trinity/common/workflows/envs/alfworld/alfworld_workflow.py index be258484e1..093d3f27cf 100644 --- a/trinity/common/workflows/envs/alfworld/alfworld_workflow.py +++ b/trinity/common/workflows/envs/alfworld/alfworld_workflow.py @@ -3,6 +3,7 @@ from trinity.common.experience import Experience from trinity.common.models.model import ModelWrapper +from trinity.common.workflows.step_wise_workflow import RewardPropagationWorkflow from trinity.common.workflows.workflow import WORKFLOWS, MultiTurnWorkflow, Task EXAMPLE_PROMPT = """ @@ -108,7 +109,7 @@ def __init__( ) self.task_desc = task.task_desc or "0" self.repeat_times = task.repeat_times - self.max_env_steps = 30 + self.max_env_steps = task.workflow_args.get("max_env_steps", 30) def get_model_response(self, messages): responses = self.model.chat(messages, n=1) @@ -177,3 +178,118 @@ def create_environment(game_file): raise ImportError(error_message) env = create_environment(game_file_path) return self.generate_env_inference_samples(env, rollout_n) + + +@WORKFLOWS.register_module("step_wise_alfworld_workflow") +class StepWiseAlfworldWorkflow(RewardPropagationWorkflow): + """ + An Alfworld workflow refactored to use the RewardPropagationWorkflow base class. + + This workflow manages an Alfworld environment, interacts with it step-by-step + using a model, and calculates a final reward based on the episode's outcome. + """ + + def __init__( + self, + model: ModelWrapper, + task: Task, + auxiliary_models: Optional[List] = None, + use_openai_client: bool = False, + ): + super().__init__( + model=model, + task=task, + auxiliary_models=auxiliary_models, + use_openai_client=use_openai_client, + ) + self.game_file_path = task.task_desc or "0" + self.max_env_steps = task.workflow_args.get("max_env_steps", 30) + + self._setup_environment() + + self.observation: Optional[str] = None + self.done: bool = False + self.final_reward: float = 0.0 + self.memory: List[dict] = [] + + def _setup_environment(self): + """Initializes the Alfworld text-based environment.""" + try: + import textworld + import textworld.gym + from alfworld.agents.environment.alfred_tw_env import ( + AlfredDemangler, + AlfredExpert, + AlfredExpertType, + ) + + def create_environment(game_file): + expert = AlfredExpert(expert_type=AlfredExpertType.HANDCODED) + request_infos = textworld.EnvInfos( + description=True, inventory=True, admissible_commands=True + ) + env_id = textworld.gym.register_game( + game_file, request_infos, wrappers=[AlfredDemangler(), expert] + ) + env = textworld.gym.make(env_id) + return env + + self.env = create_environment(self.game_file_path) + + except ImportError as e: + error_message = ( + f"Error importing Alfworld dependencies: {e}. Please ensure " + "Alfworld is installed correctly by following the instructions at " + "https://github.com/alfworld/alfworld" + ) + raise ImportError(error_message) + + def run(self) -> List[Experience]: + # Reset environment and state for a new episode + self.observation, info = self.env.reset() + self.done = False + self.final_reward = -0.1 + + self.memory.clear() + self.memory.append({"role": "system", "content": AlfWORLD_SYSTEM_PROMPT}) + + return super().run() + + def step(self, step_num: int) -> bool: + if self.done: + return False + + # Format observation for the model + format_obs = format_observation(self.observation) # type: ignore + self.memory.append({"role": "user", "content": format_obs}) + + # Get action from the model + responses = self.model.chat(self.memory) + response_text = responses[0].response_text + self.memory.append({"role": "assistant", "content": response_text}) + action = parse_action(response_text) + + # Execute action in the environment + observation, reward, done, info = self.env.step(action) + + # Update internal state + self.observation = observation + self.done = done + if self.done: + self.final_reward = reward + + # Return False to stop the run if the episode is done + return not self.done + + def reward(self, exps: list[Experience]) -> float: + return self.final_reward + + @property + def max_step_num(self) -> int: + """Return the maximum number of steps allowed in an episode.""" + return self.max_env_steps + + def __del__(self): + """Ensures the environment is closed when the workflow object is destroyed.""" + if hasattr(self, "env"): + self.env.close() diff --git a/trinity/common/workflows/step_wise_workflow.py b/trinity/common/workflows/step_wise_workflow.py index 2e3317efc8..20dd294a21 100644 --- a/trinity/common/workflows/step_wise_workflow.py +++ b/trinity/common/workflows/step_wise_workflow.py @@ -10,14 +10,19 @@ class StepWiseRewardWorkflow(Workflow): """A workflow that implements step-wise rewards for tasks.""" - def __init__(self, *, task: Task, model: ModelWrapper, auxiliary_models=None): + def __init__( + self, *, task: Task, model: ModelWrapper, auxiliary_models=None, use_openai_client=True + ): super().__init__(task=task, model=model, auxiliary_models=auxiliary_models) assert model.enable_history, ( "Rollout Model must have history enabled for step-wise rewards, please " "set `explorer.rollout_model.enable_history` to `True` in your config." ) # use the rollout model's OpenAI client to write your agent application - self.client: openai.OpenAI = model.get_openai_client() + if use_openai_client: + self.client: openai.OpenAI = model.get_openai_client() + else: + self.client = None def run(self) -> list[Experience]: """Run the workflow and return a list of experiences with step-wise rewards.""" @@ -74,14 +79,19 @@ def repeatable(self): class RewardPropagationWorkflow(Workflow): """A workflow that propagates rewards across multiple turns.""" - def __init__(self, *, task: Task, model: ModelWrapper, auxiliary_models=None): + def __init__( + self, *, task: Task, model: ModelWrapper, auxiliary_models=None, use_openai_client=True + ): super().__init__(task=task, model=model, auxiliary_models=auxiliary_models) assert model.enable_history, ( "Rollout Model must have history enabled for step-wise rewards, please " "set `explorer.rollout_model.enable_history` to `True` in your config." ) # use the rollout model's OpenAI client to write your agent application - self.client: openai.OpenAI = model.get_openai_client() + if use_openai_client: + self.client: openai.OpenAI = model.get_openai_client() + else: + self.client = None def run(self) -> list[Experience]: """Run the workflow and return a list of experiences with step-wise rewards.""" @@ -101,6 +111,9 @@ def run(self) -> list[Experience]: reward = self.reward(experiences) for exp in experiences: exp.reward = reward + if exp.metrics is None: + exp.metrics = {} + exp.metrics["actual_env_steps"] = step + 1 # +1 because step starts from 0 return experiences @abstractmethod diff --git a/trinity/common/workflows/workflow.py b/trinity/common/workflows/workflow.py index f1fcdf080f..20e03c9271 100644 --- a/trinity/common/workflows/workflow.py +++ b/trinity/common/workflows/workflow.py @@ -132,7 +132,7 @@ def run(self) -> List[Experience]: class MultiTurnWorkflow(Workflow): """ - The base workflow class for multi-turn tasks. + The base workflow class for concatenated multi-turn tasks. """ def __init__(