Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed docs/sphinx_doc/assets/DYN-NCCL.png
Binary file not shown.
Binary file removed docs/sphinx_doc/assets/DYN-STATEDICT.png
Binary file not shown.
Binary file removed docs/sphinx_doc/assets/FIXED-NCCL.png
Binary file not shown.
Binary file removed docs/sphinx_doc/assets/FIXED-STATEDICT.png
Binary file not shown.
Binary file added docs/sphinx_doc/assets/NCCL-en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sphinx_doc/assets/NCCL-zh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sphinx_doc/assets/STATEDICT-en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sphinx_doc/assets/STATEDICT-zh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/sphinx_doc/source/tutorial/develop_operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ data_processor:
threshold: 0.1
synchronizer:
sync_method: nccl
sync_style: dynamic_by_explorer
sync_style: explorer_driven
sync_interval: 2
# some other configs
```
Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx_doc/source/tutorial/example_react.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ Since agent applications may have variable interaction rounds and sample counts,

```yaml
synchronizer:
sync_style: dynamic_by_explorer # Trainer starts training immediately when enough data is generated, rather than padding to a fixed size, improving efficiency
sync_style: explorer_driven # Trainer starts training immediately when enough data is generated, rather than padding to a fixed size, improving efficiency
sync_interval: 2 # Check for model parameter updates after every two batches
```

Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source/tutorial/example_step_wise.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ In general multi-step scenarios, each run may generate various number of experie

- `buffer.trainer_input.experience_buffer.replay_buffer`: Using `PriorityQueue` allows the model to use the experiences with higher priority, which prefers newly-generated experiences by default.

- `synchronizer.sync_style = dynamic_by_explorer`: The explorer determines when to synchronize the model weights with the trainer.
- `synchronizer.sync_style = explorer_driven`: The explorer determines when to synchronize the model weights with the trainer.


The example configuration is shown as:
Expand Down Expand Up @@ -134,7 +134,7 @@ explorer:
env_vars:
TMPDIR: ${oc.env:TMPDIR,/tmp}
synchronizer:
sync_style: dynamic_by_explorer
sync_style: explorer_driven
sync_method: 'nccl'
sync_interval: 2
sync_timeout: 3600
Expand Down
93 changes: 43 additions & 50 deletions docs/sphinx_doc/source/tutorial/synchronizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,21 +29,27 @@ To achieve this, the Synchronizer:
async def train(self) -> str:
while self.train_step_num < self.total_steps:
try:
metrics = {}
# sample may be blocked due to explorer does not generate enough data
self.logger.info(f"Sample data for step {self.train_step_num + 1} started.")
sample_task = asyncio.create_task(self._sample_data())
while not sample_task.done():
# sync weight to make sure the explorer can continue to explore and generate enough data
if await self.need_sync():
# Currently, we do not record the metrics of sync_weight here
await self.sync_weight()
metrics.update(await self.sync_weight())
await asyncio.sleep(1)
exps, metrics, repr_samples = await sample_task
exps, sample_metrics, repr_samples = await sample_task
metrics.update(sample_metrics)
self.logger.info(f"Sample data for step {self.train_step_num + 1} finished.")
metrics.update(await self.train_step(exps))
if await self.need_sync():
metrics.update(await self.sync_weight())
# ...
if self.need_save():
metrics.update(
await self.save_checkpoint(save_as_hf=self.save_hf_checkpoint == "always")
)
if self.config.trainer.enable_preview:
self._log_experiences(repr_samples)
self.monitor.log(metrics, self.train_step_num)
except StopAsyncIteration:
self.logger.info("No more samples to train. Stopping training.")
Expand Down Expand Up @@ -145,77 +151,66 @@ There are **two synchronization styles** that define *when* the Explorer request
| `interval=10, offset=0` | Sync every 10 steps (both start together) |
| `interval=10, offset=5` | Explorer runs 5 steps first, then sync every 10 steps |

✅ **Best for**: Simple, predictable environments where exploration steps are short and rewards are frequent (e.g., mathematical reasoning tasks).

> 🔁 Think of it as a metronome — steady and regular.
🎯 **Best for**: Simple, predictable environments with short exploration episodes and frequent rewards (e.g., mathematical reasoning tasks).

---

### 2. `SyncStyle.DYNAMIC_BY_EXPLORER` – Demand-Driven Sync
### 2. `SyncStyle.EXPLORER_DRIVEN` – Explorer-Driven Synchronization

- The Explorer itself decides when it needs a new model.
- Workflow:
1. After completing `sync_interval` steps, the Explorer sends a request to the Synchronizer to update its parameters.
2. The Trainer detects this request in its next loop iteration and performs the synchronization.
3. Once synchronization completes, both the Explorer and Trainer continue running.
4. If a timeout occurs, the Explorer retries in the next cycle.

- Explorer decides to request a sync after generating a certain amount of data.
- It tells Synchronizer: _"I’m ready for a new model!"_
- Trainer checks this request during its normal loop and responds accordingly.
🎯 **Best for**: Scenarios where the Explorer’s pace is irregular or when on-demand model updates are preferred.

📌 **Process Flow**:
1. Explorer finishes `N` steps → sets state to `REQUIRE_SYNC`.
2. Waits for Trainer to acknowledge and perform sync.
3. Once synced, returns to `RUNNING`.
4. If timeout occurs, retries on next step.
---

✅ **Best for**: Complex, long-horizon tasks where data generation is expensive or variable (e.g., multi-turn dialogue, game playing).
### 3. `SyncStyle.TRAINER_DRIVEN` – Trainer-Driven Synchronization

> 🔄 More flexible — adapts to actual data throughput.
- The Trainer determines when to release a new model.
- Workflow:
1. Every `sync_interval` steps, the Trainer decides to request synchronization.
2. It notifies the Synchronizer to prepare pushing the new model.
3. The Explorer detects this request during its normal loop and responds by performing synchronization.

🎯 **Best for**: Cases where the Trainer has a clear, consistent training rhythm, and the Explorer passively receives updates.

---

## State Management: What’s Going On Behind the Scenes?

The Synchronizer tracks the **state** of both Trainer and Explorer to manage synchronization safely.

### Four Key States
### Three Key States

| State | Meaning |
|------|--------|
| `STOPPED` | Component has stopped working |
| `RUNNING` | Actively training or exploring |
| `REQUIRE_SYNC` | Explorer wants new weights |
| `WAITING_SYNC` | Explorer or Trainer is waiting synchronization (used in NCCL mode) |
| `REQUIRE_SYNC` | Explorer / Trainer requests new weights |

These states help prevent race conditions and ensure smooth coordination.

---

### State Transitions by Style & Method

#### 🔹 Fixed Style + NCCL Sync
- Synchronizer schedules sync every `N` steps.
- Both sides pause briefly for direct GPU sync.
- The state of the trainer toggles predictably between `RUNNING` ↔ `WAITING_SYNC`, and the state of the explorer toggles among `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC`.
### State Transitions Across Different Modes and Methods

![FIXED_STYLE_NCCL_SYNC](../../assets/FIXED-NCCL.png)
#### 🔹 NCCL Synchronization
- Both Trainer and Explorer toggle states (`RUNNING` ↔ `REQUIRE_SYNC`).
- Synchronization uses a "two-way handshake": data transfer only begins once both sides are ready.
- After synchronization completes, both return to `RUNNING`.

#### 🔹 Fixed Style + CHECKPOINT/MEMORY
- Trainer saves or sends weights periodically.
- Explorer checks at each interval and pulls updates.
- The state of the trainer remains at `RUNNING`, and the state of the explorer toggles between `RUNNING` ↔ `REQUIRE_SYNC`.
![NCCL Synchronization](../../assets/NCCL-en.png)

![FIXED_STYLE_STATEDICT_SYNC](../../assets/FIXED-STATEDICT.png)
#### 🔹 CHECKPOINT/MEMORY Synchronization
- The Trainer typically remains in `RUNNING` state (it only saves weights).
- The Explorer initiates the sync request (switches to `REQUIRE_SYNC`), pulls the weights, then returns to `RUNNING`.
- The Synchronizer acts as an intermediary, delivering model weights to the Explorer.


#### 🔹 Dynamic Style + NCCL
- Explorer signals `REQUIRE_SYNC` after enough data.
- Trainer sees the signal and initiates NCCL sync.
- The state of the trainer toggles predictably between `RUNNING` ↔ `WAITING_SYNC`, and the state of the explorer toggles between `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC`.

![DYN_STYLE_NCCL_SYNC](../../assets/DYN-NCCL.png)

#### 🔹 Dynamic Style + CHECKPOINT/MEMORY
- Explorer signals `REQUIRE_SYNC` after enough data.
- Trainer sees the signal and pushes weights to synchronizer.
- The state of the trainer remains at `RUNNING`, and the state of the explorer toggles between `RUNNING` ↔ `REQUIRE_SYNC`.

![DYN_STYLE_STATEDICT_SYNC](../../assets/DYN-STATEDICT.png)
![CHECKPOINT/MEMORY Synchronization](../../assets/STATEDICT-en.png)

---

Expand All @@ -240,9 +235,7 @@ These states help prevent race conditions and ensure smooth coordination.
| Use Case | Recommended Style |
|--------|------------------|
| Short episodes, quick feedback (e.g., math QA) | `FIXED` |
| Long interactions, delayed rewards (e.g., games, conversations) | `DYNAMIC_BY_EXPLORER` |

> 💡 `DYNAMIC_BY_EXPLORER` gives more control to the data-generating side, making it better for unbalanced or variable workloads.
| Multi-turn interactive tasks, such as multi-round dialogues, tool usage, or multi-step games | `EXPLORER_DRIVEN` or `TRAINER_DRIVEN` |

---

Expand Down
3 changes: 2 additions & 1 deletion docs/sphinx_doc/source/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -465,7 +465,8 @@ synchronizer:
- `sync_timeout`: Timeout duration for synchronization.
- `sync_style`: Style of synchronization. Options:
- `fixed`: The explorer and trainer synchronize weights every `sync_interval` steps.
- `dynamic_by_explorer`: The explorer notifies the trainer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the trainer has completed at this point.
- `explorer_driven`: The explorer notifies the trainer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the trainer has completed at this point.
- `trainer_driven`: The trainer notifies the explorer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the explorer has completed at this point.

---

Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source_zh/tutorial/develop_operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,12 +80,12 @@ data_processor:
threshold: 0.1
synchronizer:
sync_method: nccl
sync_style: dynamic_by_explorer
sync_style: explorer_driven
sync_interval: 2
# some other configs
```

```{tip}
`RewardFilter` 会减少 experience 数量,可能导致 Trainer 无法获得足够的 experience 来启动训练流程。为避免此问题,你可以使用 Trinity-RFT 提供的 {ref}`动态同步 <Synchronizer>` 功能 (`dynamic_by_explorer`)。
`RewardFilter` 会减少 experience 数量,可能导致 Trainer 无法获得足够的 experience 来启动训练流程。为避免此问题,你可以使用 Trinity-RFT 提供的 {ref}`动态同步 <Synchronizer>` 功能 (`explorer_driven`)。
上述设置意味着 `Explorer` 每运行 2 步就会与 `Trainer` 同步一次,且无论 `Trainer` 当前完成了多少步都会继续运行。这确保了只要 `Explorer` 在运行,`Trainer` 就总能获得足够的 experience 来启动训练步骤。
```
2 changes: 1 addition & 1 deletion docs/sphinx_doc/source_zh/tutorial/example_react.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ algorithm:

```yaml
synchronizer:
sync_style: dynamic_by_explorer # 当产生足够训练数据时,trainer 立即启动训练任务,而不是将生成的数据补齐到一个固定规模,能够有效提升训练效率
sync_style: explorer_driven # 当产生足够训练数据时,trainer 立即启动训练任务,而不是将生成的数据补齐到一个固定规模,能够有效提升训练效率
sync_interval: 2 # 每执行两个批次的任务后检查是否需要同步更新模型参数
```

Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source_zh/tutorial/example_step_wise.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ WORKFLOWS = Registry(

- `buffer.trainer_input.experience_buffer.replay_buffer`:使用 `PriorityQueue` 可使模型优先使用高优先级的 experience (默认为使用更新产生的 experience)。

- `synchronizer.sync_style = dynamic_by_explorer`:由 explorer 决定何时与 trainer 同步模型权重。
- `synchronizer.sync_style = explorer_driven`:由 explorer 决定何时与 trainer 同步模型权重。

示例配置如下所示:

Expand Down Expand Up @@ -129,7 +129,7 @@ explorer:
env_vars:
TMPDIR: ${oc.env:TMPDIR,/tmp}
synchronizer:
sync_style: dynamic_by_explorer
sync_style: explorer_driven
sync_method: 'nccl'
sync_interval: 2
sync_timeout: 3600
Expand Down
88 changes: 40 additions & 48 deletions docs/sphinx_doc/source_zh/tutorial/synchronizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,21 +28,27 @@
async def train(self) -> str:
while self.train_step_num < self.total_steps:
try:
metrics = {}
# sample may be blocked due to explorer does not generate enough data
self.logger.info(f"Sample data for step {self.train_step_num + 1} started.")
sample_task = asyncio.create_task(self._sample_data())
while not sample_task.done():
# sync weight to make sure the explorer can continue to explore and generate enough data
if await self.need_sync():
# Currently, we do not record the metrics of sync_weight here
await self.sync_weight()
metrics.update(await self.sync_weight())
await asyncio.sleep(1)
exps, metrics, repr_samples = await sample_task
exps, sample_metrics, repr_samples = await sample_task
metrics.update(sample_metrics)
self.logger.info(f"Sample data for step {self.train_step_num + 1} finished.")
metrics.update(await self.train_step(exps))
if await self.need_sync():
metrics.update(await self.sync_weight())
# ...
if self.need_save():
metrics.update(
await self.save_checkpoint(save_as_hf=self.save_hf_checkpoint == "always")
)
if self.config.trainer.enable_preview:
self._log_experiences(repr_samples)
self.monitor.log(metrics, self.train_step_num)
except StopAsyncIteration:
self.logger.info("No more samples to train. Stopping training.")
Expand Down Expand Up @@ -145,76 +151,64 @@ Explorer 会在以下时机检查是否需要同步:
| `interval=10, offset=0` | 每 10 步同步一次(两者同时开始) |
| `interval=10, offset=5` | Explorer 先运行 5 步,之后每 10 步同步一次 |

✅ **最适合**:简单、可预测的环境,探索步骤较短且奖励频繁(例如数学推理任务)。

> 🔁 可将其类比为节拍器 —— 稳定且规律。
🎯 **适合**:简单、可预测的环境,探索步骤较短且奖励频繁(例如数学推理任务)。

---

### 2. `SyncStyle.DYNAMIC_BY_EXPLORER` – 按需动态同步
### 2. `SyncStyle.EXPLORER_DRIVEN` – Explorer 驱动同步
- Explorer 自己决定何时需要新模型。
- 流程:
1. Explorer 完成 `sync_interval` 步后,向 Synchronizer 发出更新参数的请求。
2. Trainer 在下一次循环中发现这个请求,并完成同步。
3. 同步完成后,Explorer 和 Trainer 继续运行。
4. 若超时,Explorer 会在下一个周期重试。

- Explorer 在生成一定量数据后决定请求同步。
- 它会通知 Synchronizer:“我已经准备好获取新模型!”
- Trainer 在正常循环中检测该请求并响应。
🎯 **适合**:Explorer 节奏不固定,或希望按需更新模型。

📌 **流程说明**:
1. Explorer 完成 `N` 步后 → 将状态设为 `REQUIRE_SYNC`。
2. 等待 Trainer 确认并完成同步。
3. 同步完成后,状态恢复为 `RUNNING`。
4. 若超时,则在下一步重试。
---

✅ **最适合**:复杂、长周期任务,其中数据生成成本高或不规律(例如多轮对话、游戏对战)。
### 3. `SyncStyle.TRAINER_DRIVEN` – Trainer 驱动同步
- Trainer 决定何时发布新模型。
- 流程:
1. Trainer 每隔 `sync_interval` 步数后决定请求同步。
2. 它会通知 Synchronizer 准备推送新模型。
3. Explorer 在正常循环中检测该请求并响应同步。

> 🔄 更加灵活 —— 能根据实际数据产出动态调整
🎯 **适合**:Trainer 训练节奏明确,Explorer 被动接收更新

---

## 状态管理:背后发生了什么?

Synchronizer 通过跟踪 Trainer 和 Explorer 的**状态**,确保同步过程安全可控。

### 四个关键状态
### 三个关键状态

| 状态 | 含义 |
|------|--------|
| `STOPPED` | 组件已停止运行 |
| `RUNNING` | 正在训练或探索中 |
| `REQUIRE_SYNC` | Explorer 请求新权重 |
| `WAITING_SYNC` | Explorer 或 Trainer 正在等待同步(NCCL 模式下使用) |
| `REQUIRE_SYNC` | Explorer / Trainer 请求新权重 |

这些状态有助于避免竞态条件,保证协调过程平稳。

---

### 不同模式与方法下的状态转换

#### 🔹 固定模式 + NCCL 同步
- Synchronizer 每 `N` 步安排一次同步。
- 双方短暂暂停,进行 GPU 直连同步。
- Trainer 状态在 `RUNNING` ↔ `WAITING_SYNC` 间规律切换,Explorer 状态在 `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC` 间切换。

![FIXED_STYLE_NCCL_SYNC](../../assets/FIXED-NCCL.png)

#### 🔹 固定模式 + CHECKPOINT/MEMORY
- Trainer 定期保存或发送权重。
- Explorer 在每个间隔检查并拉取更新。
- Trainer 状态保持 `RUNNING`,Explorer 状态在 `RUNNING` ↔ `REQUIRE_SYNC` 间切换。
#### 🔹 NCCL 同步
- Trainer 和 Explorer 都会切换状态(`RUNNING` ↔ `REQUIRE_SYNC`)。
- 同步是“双向握手”:双方都准备好才开始传数据。
- 同步完成后,双方都回到 `RUNNING`。

![FIXED_STYLE_STATEDICT_SYNC](../../assets/FIXED-STATEDICT.png)
![NCCL 同步](../../assets/NCCL-zh.png)

#### 🔹 动态模式 + NCCL
- Explorer 在积累足够数据后发出 `REQUIRE_SYNC` 信号
- Trainer 检测到信号后启动 NCCL 同步
- Trainer 状态在 `RUNNING` ↔ `WAITING_SYNC` 间切换,Explorer 状态在 `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC` 间切换
#### 🔹 CHECKPOINT/MEMORY 同步
- Trainer 通常一直保持 `RUNNING`(它只负责存权重)
- Explorer 负责发起同步请求(切换到 `REQUIRE_SYNC`),拉取完权重后回到 `RUNNING`
- Synchronizer 作为“中介”,负责传递模型权重给 Explorer

![DYN_STYLE_NCCL_SYNC](../../assets/DYN-NCCL.png)

#### 🔹 动态模式 + CHECKPOINT/MEMORY
- Explorer 在积累足够数据后发出 `REQUIRE_SYNC` 信号。
- Trainer 检测到信号后将权重推送给 Synchronizer。
- Trainer 状态保持 `RUNNING`,Explorer 状态在 `RUNNING` ↔ `REQUIRE_SYNC` 间切换。

![DYN_STYLE_STATEDICT_SYNC](../../assets/DYN-STATEDICT.png)
![CHECKPOINT/MEMORY 同步](../../assets/STATEDICT-zh.png)

---

Expand All @@ -239,9 +233,7 @@ Synchronizer 通过跟踪 Trainer 和 Explorer 的**状态**,确保同步过
| 使用场景 | 推荐模式 |
|--------|------------------|
| 短周期任务,反馈迅速(如数学问答) | `FIXED` |
| 长交互任务,奖励延迟(如游戏、对话) | `DYNAMIC_BY_EXPLORER` |

> 💡 `DYNAMIC_BY_EXPLORER` 将控制权交给数据生成方,更适合负载不均衡或变化较大的任务。
| 多轮交互任务,例如多轮对话、工具调用、多步骤游戏 | `EXPLORER_DRIVEN` 或 `TRAINER_DRIVEN` |

---

Expand Down
3 changes: 2 additions & 1 deletion docs/sphinx_doc/source_zh/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,8 @@ synchronizer:
- `sync_timeout`: 同步超时时间。
- `sync_style`: 同步风格。选项:
- `fixed`: explorer 和 trainer 每隔 `sync_interval` 步同步一次权重。
- `dynamic_by_explorer`: explorer 在完成 `sync_interval` 步后通知 trainer 同步权重,而不管此时 trainer 已完成多少步。
- `explorer_driven`: explorer 在完成 `sync_interval` 步后通知 trainer 同步权重,而不管此时 trainer 已完成多少步。
- `trainer_driven`: trainer 在完成 `sync_interval` 步后通知 explorer 同步权重,而不管此时 explorer 已完成多少步。

---

Expand Down
Loading