Skip to content

Comments

Update MoE training in example#85

Open
chenyushuo wants to merge 1 commit intopan-x-c:release/0.3.0from
chenyushuo:release/v0.3.0-cys
Open

Update MoE training in example#85
chenyushuo wants to merge 1 commit intopan-x-c:release/0.3.0from
chenyushuo:release/v0.3.0-cys

Conversation

@chenyushuo
Copy link

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@pan-x-c pan-x-c requested a review from Copilot September 4, 2025 07:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the Megatron training configuration to add support for distributed checkpointing options, particularly for Mixture-of-Experts (MoE) models. The changes provide better guidance and configuration options for training MoE models with different checkpoint formats.

  • Adds distributed checkpointing configuration parameters across all model components
  • Updates documentation to clarify the two approaches for training MoE models
  • Provides clearer instructions for users choosing between mBridge and manual model conversion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
examples/ppo_countdown_megatron/train_countdown.yaml Adds use_dist_checkpointing and dist_checkpointing_path parameters to actor, reference, and critic model configurations
docs/sphinx_doc/source/tutorial/example_megatron.md Updates MoE training documentation with clearer instructions and adds the new checkpointing parameters to configuration examples

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +199 to 202
If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with:
- `use_dist_checkpointing: true`
- `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/`

Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows use_dist_checkpointing: true but the YAML configuration files consistently set this to false by default. This inconsistency could confuse users about the correct value to use when manually converting models.

Suggested change
If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with:
- `use_dist_checkpointing: true`
- `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/`
If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config as follows:
```yaml
megatron:
use_mbridge: false
use_dist_checkpointing: true
dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/
# ... other settings ...

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant