agentscope-ai · pan-x-c · Feb 5, 2026 · Feb 2, 2026 · Feb 2, 2026 · Feb 3, 2026
diff --git a/docs/sphinx_doc/assets/DYN-NCCL.png b/docs/sphinx_doc/assets/DYN-NCCL.png
diff --git a/docs/sphinx_doc/assets/DYN-STATEDICT.png b/docs/sphinx_doc/assets/DYN-STATEDICT.png
diff --git a/docs/sphinx_doc/assets/FIXED-NCCL.png b/docs/sphinx_doc/assets/FIXED-NCCL.png
diff --git a/docs/sphinx_doc/assets/FIXED-STATEDICT.png b/docs/sphinx_doc/assets/FIXED-STATEDICT.png
diff --git a/docs/sphinx_doc/assets/NCCL-en.png b/docs/sphinx_doc/assets/NCCL-en.png
diff --git a/docs/sphinx_doc/assets/NCCL-zh.png b/docs/sphinx_doc/assets/NCCL-zh.png
diff --git a/docs/sphinx_doc/assets/STATEDICT-en.png b/docs/sphinx_doc/assets/STATEDICT-en.png
diff --git a/docs/sphinx_doc/assets/STATEDICT-zh.png b/docs/sphinx_doc/assets/STATEDICT-zh.png
diff --git a/docs/sphinx_doc/source/tutorial/develop_operator.md b/docs/sphinx_doc/source/tutorial/develop_operator.md
@@ -71,7 +71,7 @@ data_processor:
           threshold: 0.1
 synchronizer:
   sync_method: nccl
-  sync_style: dynamic_by_explorer
+  sync_style: explorer_driven
   sync_interval: 2
 # some other configs
 ```

diff --git a/docs/sphinx_doc/source/tutorial/example_react.md b/docs/sphinx_doc/source/tutorial/example_react.md
@@ -155,7 +155,7 @@ Since agent applications may have variable interaction rounds and sample counts,
 
 ```yaml
 synchronizer:
-  sync_style: dynamic_by_explorer # Trainer starts training immediately when enough data is generated, rather than padding to a fixed size, improving efficiency
+  sync_style: explorer_driven # Trainer starts training immediately when enough data is generated, rather than padding to a fixed size, improving efficiency
   sync_interval: 2  # Check for model parameter updates after every two batches
 ```
 

diff --git a/docs/sphinx_doc/source/tutorial/example_step_wise.md b/docs/sphinx_doc/source/tutorial/example_step_wise.md
@@ -69,7 +69,7 @@ In general multi-step scenarios, each run may generate various number of experie
 
 - `buffer.trainer_input.experience_buffer.replay_buffer`: Using `PriorityQueue` allows the model to use the experiences with higher priority, which prefers newly-generated experiences by default.
 
-- `synchronizer.sync_style = dynamic_by_explorer`: The explorer determines when to synchronize the model weights with the trainer.
+- `synchronizer.sync_style = explorer_driven`: The explorer determines when to synchronize the model weights with the trainer.
 
 
 The example configuration is shown as:
@@ -134,7 +134,7 @@ explorer:
   env_vars:
     TMPDIR: ${oc.env:TMPDIR,/tmp}
 synchronizer:
-  sync_style: dynamic_by_explorer
+  sync_style: explorer_driven
   sync_method: 'nccl'
   sync_interval: 2
   sync_timeout: 3600

diff --git a/docs/sphinx_doc/source/tutorial/synchronizer.md b/docs/sphinx_doc/source/tutorial/synchronizer.md
@@ -29,21 +29,27 @@ To achieve this, the Synchronizer:
 async def train(self) -> str:
     while self.train_step_num < self.total_steps:
         try:
+            metrics = {}
             # sample may be blocked due to explorer does not generate enough data
             self.logger.info(f"Sample data for step {self.train_step_num + 1} started.")
             sample_task = asyncio.create_task(self._sample_data())
             while not sample_task.done():
                 # sync weight to make sure the explorer can continue to explore and generate enough data
                 if await self.need_sync():
-                    # Currently, we do not record the metrics of sync_weight here
-                    await self.sync_weight()
+                    metrics.update(await self.sync_weight())
                 await asyncio.sleep(1)
-            exps, metrics, repr_samples = await sample_task
+            exps, sample_metrics, repr_samples = await sample_task
+            metrics.update(sample_metrics)
             self.logger.info(f"Sample data for step {self.train_step_num + 1} finished.")
             metrics.update(await self.train_step(exps))
             if await self.need_sync():
                 metrics.update(await self.sync_weight())
-            # ...
+            if self.need_save():
+                metrics.update(
+                    await self.save_checkpoint(save_as_hf=self.save_hf_checkpoint == "always")
+                )
+            if self.config.trainer.enable_preview:
+                self._log_experiences(repr_samples)
             self.monitor.log(metrics, self.train_step_num)
         except StopAsyncIteration:
             self.logger.info("No more samples to train. Stopping training.")
@@ -145,77 +151,66 @@ There are **two synchronization styles** that define *when* the Explorer request
 | `interval=10, offset=0` | Sync every 10 steps (both start together) |
 | `interval=10, offset=5` | Explorer runs 5 steps first, then sync every 10 steps |
 
-✅ **Best for**: Simple, predictable environments where exploration steps are short and rewards are frequent (e.g., mathematical reasoning tasks).
-
-> 🔁 Think of it as a metronome — steady and regular.
+🎯 **Best for**: Simple, predictable environments with short exploration episodes and frequent rewards (e.g., mathematical reasoning tasks).
 
 ---
 
-### 2. `SyncStyle.DYNAMIC_BY_EXPLORER` – Demand-Driven Sync
+### 2. `SyncStyle.EXPLORER_DRIVEN` – Explorer-Driven Synchronization
+
+- The Explorer itself decides when it needs a new model.
+- Workflow:
+  1. After completing `sync_interval` steps, the Explorer sends a request to the Synchronizer to update its parameters.
+  2. The Trainer detects this request in its next loop iteration and performs the synchronization.
+  3. Once synchronization completes, both the Explorer and Trainer continue running.
+  4. If a timeout occurs, the Explorer retries in the next cycle.
 
-- Explorer decides to request a sync after generating a certain amount of data.
-- It tells Synchronizer: _"I’m ready for a new model!"_
-- Trainer checks this request during its normal loop and responds accordingly.
+🎯 **Best for**: Scenarios where the Explorer’s pace is irregular or when on-demand model updates are preferred.
 
-📌 **Process Flow**:
-1. Explorer finishes `N` steps → sets state to `REQUIRE_SYNC`.
-2. Waits for Trainer to acknowledge and perform sync.
-3. Once synced, returns to `RUNNING`.
-4. If timeout occurs, retries on next step.
+---
 
-✅ **Best for**: Complex, long-horizon tasks where data generation is expensive or variable (e.g., multi-turn dialogue, game playing).
+### 3. `SyncStyle.TRAINER_DRIVEN` – Trainer-Driven Synchronization
 
-> 🔄 More flexible — adapts to actual data throughput.
+- The Trainer determines when to release a new model.
+- Workflow:
+  1. Every `sync_interval` steps, the Trainer decides to request synchronization.
+  2. It notifies the Synchronizer to prepare pushing the new model.
+  3. The Explorer detects this request during its normal loop and responds by performing synchronization.
+
+🎯 **Best for**: Cases where the Trainer has a clear, consistent training rhythm, and the Explorer passively receives updates.
 
 ---
 
 ## State Management: What’s Going On Behind the Scenes?
 
 The Synchronizer tracks the **state** of both Trainer and Explorer to manage synchronization safely.
 
-### Four Key States
+### Three Key States
 
 | State | Meaning |
 |------|--------|
 | `STOPPED` | Component has stopped working |
 | `RUNNING` | Actively training or exploring |
-| `REQUIRE_SYNC` | Explorer wants new weights |
-| `WAITING_SYNC` | Explorer or Trainer is waiting synchronization (used in NCCL mode) |
+| `REQUIRE_SYNC` | Explorer / Trainer requests new weights |
 
 These states help prevent race conditions and ensure smooth coordination.
 
 ---
 
-### State Transitions by Style & Method
-
-#### 🔹 Fixed Style + NCCL Sync
-- Synchronizer schedules sync every `N` steps.
-- Both sides pause briefly for direct GPU sync.
-- The state of the trainer toggles predictably between `RUNNING` ↔ `WAITING_SYNC`, and the state of the explorer toggles among `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC`.
+### State Transitions Across Different Modes and Methods
 
-![FIXED_STYLE_NCCL_SYNC](../../assets/FIXED-NCCL.png)
+#### 🔹 NCCL Synchronization
+- Both Trainer and Explorer toggle states (`RUNNING` ↔ `REQUIRE_SYNC`).
+- Synchronization uses a "two-way handshake": data transfer only begins once both sides are ready.
+- After synchronization completes, both return to `RUNNING`.
 
-#### 🔹 Fixed Style + CHECKPOINT/MEMORY
-- Trainer saves or sends weights periodically.
-- Explorer checks at each interval and pulls updates.
-- The state of the trainer remains at `RUNNING`, and the state of the explorer toggles between `RUNNING` ↔ `REQUIRE_SYNC`.
+![NCCL Synchronization](../../assets/NCCL-en.png)
 
-![FIXED_STYLE_STATEDICT_SYNC](../../assets/FIXED-STATEDICT.png)
+#### 🔹 CHECKPOINT/MEMORY Synchronization
+- The Trainer typically remains in `RUNNING` state (it only saves weights).
+- The Explorer initiates the sync request (switches to `REQUIRE_SYNC`), pulls the weights, then returns to `RUNNING`.
+- The Synchronizer acts as an intermediary, delivering model weights to the Explorer.
 
-
-#### 🔹 Dynamic Style + NCCL
-- Explorer signals `REQUIRE_SYNC` after enough data.
-- Trainer sees the signal and initiates NCCL sync.
-- The state of the trainer toggles predictably between `RUNNING` ↔ `WAITING_SYNC`, and the state of the explorer toggles between `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC`.
-
-![DYN_STYLE_NCCL_SYNC](../../assets/DYN-NCCL.png)
-
-#### 🔹 Dynamic Style + CHECKPOINT/MEMORY
-- Explorer signals `REQUIRE_SYNC` after enough data.
-- Trainer sees the signal and pushes weights to synchronizer.
-- The state of the trainer remains at `RUNNING`, and the state of the explorer toggles between `RUNNING` ↔ `REQUIRE_SYNC`.
-
-![DYN_STYLE_STATEDICT_SYNC](../../assets/DYN-STATEDICT.png)
+![CHECKPOINT/MEMORY Synchronization](../../assets/STATEDICT-en.png)
 
 ---
 
@@ -240,9 +235,7 @@ These states help prevent race conditions and ensure smooth coordination.
 | Use Case | Recommended Style |
 |--------|------------------|
 | Short episodes, quick feedback (e.g., math QA) | `FIXED` |
-| Long interactions, delayed rewards (e.g., games, conversations) | `DYNAMIC_BY_EXPLORER` |
-
-> 💡 `DYNAMIC_BY_EXPLORER` gives more control to the data-generating side, making it better for unbalanced or variable workloads.
+| Multi-turn interactive tasks, such as multi-round dialogues, tool usage, or multi-step games | `EXPLORER_DRIVEN` or `TRAINER_DRIVEN` |
 
 ---
 

diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md
@@ -465,7 +465,8 @@ synchronizer:
 - `sync_timeout`: Timeout duration for synchronization.
 - `sync_style`: Style of synchronization. Options:
   - `fixed`: The explorer and trainer synchronize weights every `sync_interval` steps.
-  - `dynamic_by_explorer`: The explorer notifies the trainer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the trainer has completed at this point.
+  - `explorer_driven`: The explorer notifies the trainer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the trainer has completed at this point.
+  - `trainer_driven`: The trainer notifies the explorer to synchronize weights after completing `sync_interval` steps, regardless of how many steps the explorer has completed at this point.
 
 ---
 

diff --git a/docs/sphinx_doc/source_zh/tutorial/develop_operator.md b/docs/sphinx_doc/source_zh/tutorial/develop_operator.md
@@ -80,12 +80,12 @@ data_processor:
           threshold: 0.1
 synchronizer:
   sync_method: nccl
-  sync_style: dynamic_by_explorer
+  sync_style: explorer_driven
   sync_interval: 2
 # some other configs
 ```
 
 ```{tip}
-`RewardFilter` 会减少 experience 数量，可能导致 Trainer 无法获得足够的 experience 来启动训练流程。为避免此问题，你可以使用 Trinity-RFT 提供的 {ref}`动态同步 <Synchronizer>` 功能 (`dynamic_by_explorer`)。
+`RewardFilter` 会减少 experience 数量，可能导致 Trainer 无法获得足够的 experience 来启动训练流程。为避免此问题，你可以使用 Trinity-RFT 提供的 {ref}`动态同步 <Synchronizer>` 功能 (`explorer_driven`)。
 上述设置意味着 `Explorer` 每运行 2 步就会与 `Trainer` 同步一次，且无论 `Trainer` 当前完成了多少步都会继续运行。这确保了只要 `Explorer` 在运行，`Trainer` 就总能获得足够的 experience 来启动训练步骤。
 ```
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_react.md b/docs/sphinx_doc/source_zh/tutorial/example_react.md
@@ -162,7 +162,7 @@ algorithm:
 
 ```yaml
 synchronizer:
-  sync_style: dynamic_by_explorer # 当产生足够训练数据时，trainer 立即启动训练任务，而不是将生成的数据补齐到一个固定规模，能够有效提升训练效率
+  sync_style: explorer_driven # 当产生足够训练数据时，trainer 立即启动训练任务，而不是将生成的数据补齐到一个固定规模，能够有效提升训练效率
   sync_interval: 2  # 每执行两个批次的任务后检查是否需要同步更新模型参数
 ```
 

diff --git a/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md b/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md
@@ -68,7 +68,7 @@ WORKFLOWS = Registry(
 
 - `buffer.trainer_input.experience_buffer.replay_buffer`：使用 `PriorityQueue` 可使模型优先使用高优先级的 experience （默认为使用更新产生的 experience）。
 
-- `synchronizer.sync_style = dynamic_by_explorer`：由 explorer 决定何时与 trainer 同步模型权重。
+- `synchronizer.sync_style = explorer_driven`：由 explorer 决定何时与 trainer 同步模型权重。
 
 示例配置如下所示：
 
@@ -129,7 +129,7 @@ explorer:
   env_vars:
     TMPDIR: ${oc.env:TMPDIR,/tmp}
 synchronizer:
-  sync_style: dynamic_by_explorer
+  sync_style: explorer_driven
   sync_method: 'nccl'
   sync_interval: 2
   sync_timeout: 3600

diff --git a/docs/sphinx_doc/source_zh/tutorial/synchronizer.md b/docs/sphinx_doc/source_zh/tutorial/synchronizer.md
@@ -28,21 +28,27 @@
 async def train(self) -> str:
     while self.train_step_num < self.total_steps:
         try:
+            metrics = {}
             # sample may be blocked due to explorer does not generate enough data
             self.logger.info(f"Sample data for step {self.train_step_num + 1} started.")
             sample_task = asyncio.create_task(self._sample_data())
             while not sample_task.done():
                 # sync weight to make sure the explorer can continue to explore and generate enough data
                 if await self.need_sync():
-                    # Currently, we do not record the metrics of sync_weight here
-                    await self.sync_weight()
+                    metrics.update(await self.sync_weight())
                 await asyncio.sleep(1)
-            exps, metrics, repr_samples = await sample_task
+            exps, sample_metrics, repr_samples = await sample_task
+            metrics.update(sample_metrics)
             self.logger.info(f"Sample data for step {self.train_step_num + 1} finished.")
             metrics.update(await self.train_step(exps))
             if await self.need_sync():
                 metrics.update(await self.sync_weight())
-            # ...
+            if self.need_save():
+                metrics.update(
+                    await self.save_checkpoint(save_as_hf=self.save_hf_checkpoint == "always")
+                )
+            if self.config.trainer.enable_preview:
+                self._log_experiences(repr_samples)
             self.monitor.log(metrics, self.train_step_num)
         except StopAsyncIteration:
             self.logger.info("No more samples to train. Stopping training.")
@@ -145,76 +151,64 @@ Explorer 会在以下时机检查是否需要同步：
 | `interval=10, offset=0` | 每 10 步同步一次（两者同时开始） |
 | `interval=10, offset=5` | Explorer 先运行 5 步，之后每 10 步同步一次 |
 
-✅ **最适合**：简单、可预测的环境，探索步骤较短且奖励频繁（例如数学推理任务）。
-
-> 🔁 可将其类比为节拍器 —— 稳定且规律。
+🎯 **适合**：简单、可预测的环境，探索步骤较短且奖励频繁（例如数学推理任务）。
 
 ---
 
-### 2. `SyncStyle.DYNAMIC_BY_EXPLORER` – 按需动态同步
+### 2. `SyncStyle.EXPLORER_DRIVEN` – Explorer 驱动同步
+- Explorer 自己决定何时需要新模型。
+- 流程：
+  1. Explorer 完成 `sync_interval` 步后，向 Synchronizer 发出更新参数的请求。
+  2. Trainer 在下一次循环中发现这个请求，并完成同步。
+  3. 同步完成后，Explorer 和 Trainer 继续运行。
+  4. 若超时，Explorer 会在下一个周期重试。
 
-- Explorer 在生成一定量数据后决定请求同步。
-- 它会通知 Synchronizer：“我已经准备好获取新模型！”
-- Trainer 在正常循环中检测该请求并响应。
+🎯 **适合**：Explorer 节奏不固定，或希望按需更新模型。
 
-📌 **流程说明**：
-1. Explorer 完成 `N` 步后 → 将状态设为 `REQUIRE_SYNC`。
-2. 等待 Trainer 确认并完成同步。
-3. 同步完成后，状态恢复为 `RUNNING`。
-4. 若超时，则在下一步重试。
+---
 
-✅ **最适合**：复杂、长周期任务，其中数据生成成本高或不规律（例如多轮对话、游戏对战）。
+### 3. `SyncStyle.TRAINER_DRIVEN` – Trainer 驱动同步
+- Trainer 决定何时发布新模型。
+- 流程：
+  1. Trainer 每隔 `sync_interval` 步数后决定请求同步。
+  2. 它会通知 Synchronizer 准备推送新模型。
+  3. Explorer 在正常循环中检测该请求并响应同步。
 
-> 🔄 更加灵活 —— 能根据实际数据产出动态调整。
+🎯 **适合**：Trainer 训练节奏明确，Explorer 被动接收更新。
 
 ---
 
 ## 状态管理：背后发生了什么？
 
 Synchronizer 通过跟踪 Trainer 和 Explorer 的**状态**，确保同步过程安全可控。
 
-### 四个关键状态
+### 三个关键状态
 
 | 状态 | 含义 |
 |------|--------|
 | `STOPPED` | 组件已停止运行 |
 | `RUNNING` | 正在训练或探索中 |
-| `REQUIRE_SYNC` | Explorer 请求新权重 |
-| `WAITING_SYNC` | Explorer 或 Trainer 正在等待同步（NCCL 模式下使用） |
+| `REQUIRE_SYNC` | Explorer / Trainer 请求新权重 |
 
 这些状态有助于避免竞态条件，保证协调过程平稳。
 
 ---
 
 ### 不同模式与方法下的状态转换
 
-#### 🔹 固定模式 + NCCL 同步
-- Synchronizer 每 `N` 步安排一次同步。
-- 双方短暂暂停，进行 GPU 直连同步。
-- Trainer 状态在 `RUNNING` ↔ `WAITING_SYNC` 间规律切换，Explorer 状态在 `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC` 间切换。
-
-![FIXED_STYLE_NCCL_SYNC](../../assets/FIXED-NCCL.png)
-
-#### 🔹 固定模式 + CHECKPOINT/MEMORY
-- Trainer 定期保存或发送权重。
-- Explorer 在每个间隔检查并拉取更新。
-- Trainer 状态保持 `RUNNING`，Explorer 状态在 `RUNNING` ↔ `REQUIRE_SYNC` 间切换。
+#### 🔹 NCCL 同步
+- Trainer 和 Explorer 都会切换状态（`RUNNING` ↔ `REQUIRE_SYNC`）。
+- 同步是“双向握手”：双方都准备好才开始传数据。
+- 同步完成后，双方都回到 `RUNNING`。
 
-![FIXED_STYLE_STATEDICT_SYNC](../../assets/FIXED-STATEDICT.png)
+![NCCL 同步](../../assets/NCCL-zh.png)
 
-#### 🔹 动态模式 + NCCL
-- Explorer 在积累足够数据后发出 `REQUIRE_SYNC` 信号。
-- Trainer 检测到信号后启动 NCCL 同步。
-- Trainer 状态在 `RUNNING` ↔ `WAITING_SYNC` 间切换，Explorer 状态在 `RUNNING` → `REQUIRE_SYNC` → `WAITING_SYNC` 间切换。
+#### 🔹 CHECKPOINT/MEMORY 同步
+- Trainer 通常一直保持 `RUNNING`（它只负责存权重）。
+- Explorer 负责发起同步请求（切换到 `REQUIRE_SYNC`），拉取完权重后回到 `RUNNING`。
+- Synchronizer 作为“中介”，负责传递模型权重给 Explorer。
 
-![DYN_STYLE_NCCL_SYNC](../../assets/DYN-NCCL.png)
-
-#### 🔹 动态模式 + CHECKPOINT/MEMORY
-- Explorer 在积累足够数据后发出 `REQUIRE_SYNC` 信号。
-- Trainer 检测到信号后将权重推送给 Synchronizer。
-- Trainer 状态保持 `RUNNING`，Explorer 状态在 `RUNNING` ↔ `REQUIRE_SYNC` 间切换。
-
-![DYN_STYLE_STATEDICT_SYNC](../../assets/DYN-STATEDICT.png)
+![CHECKPOINT/MEMORY 同步](../../assets/STATEDICT-zh.png)
 
 ---
 
@@ -239,9 +233,7 @@ Synchronizer 通过跟踪 Trainer 和 Explorer 的**状态**，确保同步过
 | 使用场景 | 推荐模式 |
 |--------|------------------|
 | 短周期任务，反馈迅速（如数学问答） | `FIXED` |
-| 长交互任务，奖励延迟（如游戏、对话） | `DYNAMIC_BY_EXPLORER` |
-
-> 💡 `DYNAMIC_BY_EXPLORER` 将控制权交给数据生成方，更适合负载不均衡或变化较大的任务。
+| 多轮交互任务，例如多轮对话、工具调用、多步骤游戏 | `EXPLORER_DRIVEN` 或 `TRAINER_DRIVEN` |
 
 ---
 

diff --git a/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md b/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md
@@ -462,7 +462,8 @@ synchronizer:
 - `sync_timeout`: 同步超时时间。
 - `sync_style`: 同步风格。选项：
   - `fixed`: explorer 和 trainer 每隔 `sync_interval` 步同步一次权重。
-  - `dynamic_by_explorer`: explorer 在完成 `sync_interval` 步后通知 trainer 同步权重，而不管此时 trainer 已完成多少步。
+  - `explorer_driven`: explorer 在完成 `sync_interval` 步后通知 trainer 同步权重，而不管此时 trainer 已完成多少步。
+  - `trainer_driven`: trainer 在完成 `sync_interval` 步后通知 explorer 同步权重，而不管此时 explorer 已完成多少步。
 
 ---