Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions benchmark/config/countdown-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,9 @@ buffer:
rollout_args:
temperature: 1.0
logprobs: 0
default_workflow_type: math_workflow
default_reward_fn_type: countdown_reward
eval_tasksets: []
default_workflow_type: math_workflow
default_reward_fn_type: countdown_reward
system_prompt: null
reply_prefix: null
trainer_input:
experience_buffer:
name: experience_buffer
Expand Down
8 changes: 3 additions & 5 deletions benchmark/config/gsm8k-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,9 @@ buffer:
rollout_args:
temperature: 1.0
logprobs: 0
default_workflow_type: math_workflow
default_reward_fn_type: math_reward
eval_tasksets: []
default_workflow_type: math_workflow
default_reward_fn_type: math_reward
system_prompt: null
reply_prefix: null
trainer_input:
experience_buffer:
name: experience_buffer
Expand Down Expand Up @@ -79,7 +77,7 @@ trainer:
enable_preview: true
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 10240
max_token_len_per_gpu: 10240
ulysses_sequence_parallel_size: 1
monitor:
monitor_type: wandb
Expand Down
12 changes: 6 additions & 6 deletions docs/sphinx_doc/source/tutorial/example_async_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,14 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
storage_type: queue
path: 'sqlite:///gsm8k.db'
explorer:
runner_num: 32
runner_per_model: 8
rollout_model:
engine_num: 4
synchronizer:
Expand Down Expand Up @@ -86,7 +86,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
Expand All @@ -98,7 +98,7 @@ synchronizer:
trainer:
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 16384
max_token_len_per_gpu: 16384
ulysses_sequence_parallel_size: 1
```

Expand Down Expand Up @@ -133,7 +133,7 @@ cluster: # important
gpu_per_node: 8
explorer:
name: 'explorer_new' # important
runner_num: 64
runner_per_model: 8
rollout_model:
engine_num: 8
buffer:
Expand All @@ -150,7 +150,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
Expand Down
7 changes: 4 additions & 3 deletions docs/sphinx_doc/source/tutorial/example_reasoning_basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
eval_tasksets:
- name: gsm8k-eval
storage_type: file
Expand All @@ -86,15 +87,15 @@ buffer:
format:
prompt_key: 'question'
response_key: 'answer'
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
storage_type: queue
path: 'sqlite:///gsm8k.db'
explorer:
eval_interval: 50
runner_num: 16
runner_per_model: 16
rollout_model:
engine_num: 1
synchronizer:
Expand All @@ -117,7 +118,7 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml

## Optional: RFT with SFT Warmup

Before RFT, we may use SFT as a warmup step. Trinity-RFT supports adding SFT warmup stage before RFT by setting `stages` in the config file. The `sft_warmup_dataset` specifies the dataset used for SFT warmup, and `sft_warmup_steps` specifies the number of training steps for SFT warmup.
Before RFT, we may use SFT as a warmup step. Trinity-RFT supports adding SFT warmup stage before RFT by setting `stages` in the config file. The `experience_buffer` specifies the dataset used for SFT warmup, and `total_steps` specifies the number of training steps for SFT warmup.

```yaml
# Properly add the following configs in gsm8k.yaml
Expand Down
6 changes: 3 additions & 3 deletions docs/sphinx_doc/source/tutorial/example_step_wise.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,15 +121,15 @@ buffer:
workflow_args:
max_env_steps: 30
enable_progress_bar: false
default_workflow_type: 'step_wise_alfworld_workflow'
default_workflow_type: 'step_wise_alfworld_workflow'
trainer_input:
experience_buffer:
name: alfworld_buffer
storage_type: queue
use_priority_queue: true
explorer:
max_repeat_times_per_runner: 1
runner_num: 32
runner_per_model: 32
max_timeout: 3600
rollout_model:
enable_history: true
Expand All @@ -152,7 +152,7 @@ trainer:
save_interval: 50
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 16384
max_token_len_per_gpu: 16384
ulysses_sequence_parallel_size: 1
```

Expand Down
16 changes: 10 additions & 6 deletions docs/sphinx_doc/source/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ buffer:
batch_size: 32
train_batch_size: 256
total_epochs: 100
total_steps: null

explorer_input:
taskset:
Expand All @@ -214,9 +215,6 @@ buffer:
...
buffer_2:
...

default_workflow_type: 'math_workflow'
default_reward_fn_type: 'countdown_reward'
```

- `batch_size`: Number of tasks used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
Expand All @@ -231,6 +229,9 @@ Defines the dataset(s) used by the explorer for training and evaluation.
```yaml
buffer:
explorer_input:
default_workflow_type: 'math_workflow'
default_eval_workflow_type: 'math_workflow'
default_reward_fn_type: 'countdown_reward'
taskset:
name: countdown_train
storage_type: file
Expand Down Expand Up @@ -262,7 +263,10 @@ buffer:
```

- `buffer.explorer_input.taskset`: Task dataset used for training exploration policies.
- `buffer.explorer_input.eval_taskset`: List of task datasets used for evaluation.
- `buffer.explorer_input.eval_tasksets`: List of task datasets used for evaluation.
- `buffer.explorer_input.default_workflow_type`: Default workflow type for all task datasets under `explorer_input` if not specified at the dataset level.
- `buffer.explorer_input.default_eval_workflow_type`: Default evaluation workflow type for all eval task datasets under `explorer_input` if not specified at the dataset level.
- `buffer.explorer_input.default_reward_fn_type`: Default reward function type for all task datasets under `explorer_input` if not specified at the dataset level.

The configuration for each task dataset is defined as follows:

Expand Down Expand Up @@ -413,7 +417,7 @@ trainer:
save_strategy: "unrestricted"
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 16384
max_token_len_per_gpu: 16384
ulysses_sequence_parallel_size: 1
trainer_config: null
```
Expand All @@ -429,7 +433,7 @@ trainer:
- `unrestricted`: No restrictions on saving operations; multiple nodes, processes, or threads are allowed to save the model simultaneously.
- `grad_clip`: Gradient clipping for updates.
- `use_dynamic_bsz`: Whether to use dynamic batch size.
- `ppo_max_token_len_per_gpu`: The maximum number of tokens to be processed in forward and backward when updating the policy. Effective when `use_dynamic_bsz=true`.
- `max_token_len_per_gpu`: The maximum number of tokens to be processed in forward and backward when updating the policy. Effective when `use_dynamic_bsz=true`.
- `ulysses_sequence_parallel_size`: Sequence parallel size.
- `trainer_config`: The trainer configuration provided inline.
---
Expand Down
10 changes: 5 additions & 5 deletions docs/sphinx_doc/source_zh/tutorial/example_async_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,14 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
storage_type: queue
path: 'sqlite:///gsm8k.db'
explorer:
runner_num: 32
runner_per_model: 16
rollout_model:
engine_num: 4
synchronizer:
Expand Down Expand Up @@ -86,7 +86,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
Expand Down Expand Up @@ -133,7 +133,7 @@ cluster: # important
gpu_per_node: 8
explorer:
name: 'explorer_new' # important
runner_num: 64
runner_per_model: 8
rollout_model:
engine_num: 8
buffer:
Expand All @@ -150,7 +150,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
Expand Down
7 changes: 4 additions & 3 deletions docs/sphinx_doc/source_zh/tutorial/example_reasoning_basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 1.0
default_workflow_type: 'math_workflow'
eval_tasksets:
- name: gsm8k-eval
storage_type: file
Expand All @@ -86,15 +87,15 @@ buffer:
format:
prompt_key: 'question'
response_key: 'answer'
default_workflow_type: 'math_workflow'
default_workflow_type: 'math_workflow'
trainer_input:
experience_buffer:
name: gsm8k_buffer
storage_type: queue
path: 'sqlite:///gsm8k.db'
explorer:
eval_interval: 50
runner_num: 16
runner_per_model: 16
rollout_model:
engine_num: 1
synchronizer:
Expand All @@ -117,7 +118,7 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml

## 进阶选项:带 SFT warmup 的 RFT

在进行 RFT 之前,我们可以先使用 SFT 作为预热步骤。Trinity-RFT 支持通过在配置文件中设置 `stages` 来添加 SFT 预热阶段。`sft_warmup_dataset` 指定用于 SFT warmup 的数据集,`sft_warmup_steps` 指定 SFT warmup 的训练步数。
在进行 RFT 之前,我们可以先使用 SFT 作为预热步骤。Trinity-RFT 支持通过在配置文件中设置 `stages` 来添加 SFT 预热阶段。`experience_buffer` 指定用于 SFT warmup 的数据集,`total_steps` 指定 SFT warmup 的训练步数。

```yaml
# 在 gsm8k.yaml 中正确添加以下配置
Expand Down
6 changes: 3 additions & 3 deletions docs/sphinx_doc/source_zh/tutorial/example_step_wise.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,15 +119,15 @@ buffer:
workflow_args:
max_env_steps: 30
enable_progress_bar: false
default_workflow_type: 'step_wise_alfworld_workflow'
default_workflow_type: 'step_wise_alfworld_workflow'
trainer_input:
experience_buffer:
name: alfworld_buffer
storage_type: queue
use_priority_queue: true
explorer:
max_repeat_times_per_runner: 1
runner_num: 32
runner_per_model: 16
max_timeout: 3600
rollout_model:
enable_history: true
Expand All @@ -150,7 +150,7 @@ trainer:
save_interval: 50
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 16384
max_token_len_per_gpu: 16384
ulysses_sequence_parallel_size: 1
```

Expand Down
17 changes: 9 additions & 8 deletions docs/sphinx_doc/source_zh/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,6 @@ buffer:
...
buffer_2:
...

default_workflow_type: 'math_workflow'
default_reward_fn_type: 'countdown_reward'
```

- `batch_size`: 每个训练步骤使用的任务数。*请勿手动将此值乘以 `algorithm.repeat_times`*。
Expand All @@ -231,6 +228,9 @@ buffer:
```yaml
buffer:
explorer_input:
default_workflow_type: 'math_workflow'
default_eval_workflow_type: 'math_workflow'
default_reward_fn_type: 'countdown_reward'
taskset:
name: countdown_train
storage_type: file
Expand All @@ -256,13 +256,14 @@ buffer:
response_key: 'answer'
rollout_args:
temperature: 0.1
default_workflow_type: 'math_workflow'
default_reward_fn_type: 'countdown_reward'
...
```

- `buffer.explorer_input.taskset`: 用于训练探索策略的任务数据集。
- `buffer.explorer_input.eval_taskset`: 用于评估的任务数据集列表。
- `buffer.explorer_input.eval_tasksets`: 用于评测的任务数据集列表。
- `buffer.explorer_input.default_workflow_type`: 若未在数据集级别指定,则为所有任务数据集设置默认的工作流类型。
- `buffer.explorer_input.default_eval_workflow_type`: 若未在数据集级别指定,则为所有评测任务数据集设置默认的工作流类型。
- `buffer.explorer_input.default_reward_fn_type`: 若未在数据集级别指定,则为所有任务数据集设置默认的奖励类型。

每个任务数据集的配置定义如下:

Expand Down Expand Up @@ -413,7 +414,7 @@ trainer:
save_strategy: "unrestricted"
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 16384
max_token_len_per_gpu: 16384
ulysses_sequence_parallel_size: 1
trainer_config: null
```
Expand All @@ -429,7 +430,7 @@ trainer:
- `unrestricted`:不限制保存操作,允许多个节点、进程或线程同时保存模型。
- `grad_clip`: 梯度裁剪阈值。
- `use_dynamic_bsz`: 是否使用动态批量大小。
- `ppo_max_token_len_per_gpu`: 训练过程中,每个 GPU 最大 token 长度; 当 `use_dynamic_bsz=true` 时生效。
- `max_token_len_per_gpu`: 训练过程中,每个 GPU 最大 token 长度; 当 `use_dynamic_bsz=true` 时生效。
- `ulysses_sequence_parallel_size`: 序列并行的并行度,即用于分割单个序列的 GPU 数量。
- `trainer_config`: 内联提供的 trainer 配置。

Expand Down
2 changes: 1 addition & 1 deletion examples/RAFT_alfworld/RAFT_alfworld_7B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ trainer:
save_interval: 100000
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 20000 # Adjusted for alfworld longer sequences
max_token_len_per_gpu: 20000 # Adjusted for alfworld longer sequences
ulysses_sequence_parallel_size: 1
monitor:
monitor_type: wandb
2 changes: 1 addition & 1 deletion examples/RAFT_alfworld/RAFT_reflect_alfworld_7B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ trainer:
save_interval: 100000
grad_clip: 1.0
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 20000 # Adjusted for alfworld longer sequences
max_token_len_per_gpu: 20000 # Adjusted for alfworld longer sequences
ulysses_sequence_parallel_size: 1
monitor:
monitor_type: wandb
Loading