Update MoE training in example by chenyushuo · Pull Request #85 · pan-x-c/Trinity-RFT

chenyushuo · 2025-09-04T06:49:19Z

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

Copilot

Pull Request Overview

This PR updates the Megatron training configuration to add support for distributed checkpointing options, particularly for Mixture-of-Experts (MoE) models. The changes provide better guidance and configuration options for training MoE models with different checkpoint formats.

Adds distributed checkpointing configuration parameters across all model components
Updates documentation to clarify the two approaches for training MoE models
Provides clearer instructions for users choosing between mBridge and manual model conversion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
examples/ppo_countdown_megatron/train_countdown.yaml	Adds `use_dist_checkpointing` and `dist_checkpointing_path` parameters to actor, reference, and critic model configurations
docs/sphinx_doc/source/tutorial/example_megatron.md	Updates MoE training documentation with clearer instructions and adds the new checkpointing parameters to configuration examples

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-04T07:18:11Z

docs/sphinx_doc/source/tutorial/example_megatron.md

+   If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with:
+   - `use_dist_checkpointing: true`
+   - `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/`



The documentation shows use_dist_checkpointing: true but the YAML configuration files consistently set this to false by default. This inconsistency could confuse users about the correct value to use when manually converting models.

Suggested change

If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with:

- `use_dist_checkpointing: true`

- `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/`

If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config as follows:

```yaml

megatron:

use_mbridge: false

use_dist_checkpointing: true

dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/

# ... other settings ...

Update MoE training in example

98353cb

pan-x-c requested a review from Copilot September 4, 2025 07:17

Copilot AI reviewed Sep 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Update MoE training in example#85

Update MoE training in example#85
chenyushuo wants to merge 1 commit intopan-x-c:release/0.3.0from
chenyushuo:release/v0.3.0-cys

chenyushuo commented Sep 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

chenyushuo commented Sep 4, 2025

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant