Update MoE training in example#85
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR updates the Megatron training configuration to add support for distributed checkpointing options, particularly for Mixture-of-Experts (MoE) models. The changes provide better guidance and configuration options for training MoE models with different checkpoint formats.
- Adds distributed checkpointing configuration parameters across all model components
- Updates documentation to clarify the two approaches for training MoE models
- Provides clearer instructions for users choosing between mBridge and manual model conversion
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| examples/ppo_countdown_megatron/train_countdown.yaml | Adds use_dist_checkpointing and dist_checkpointing_path parameters to actor, reference, and critic model configurations |
| docs/sphinx_doc/source/tutorial/example_megatron.md | Updates MoE training documentation with clearer instructions and adds the new checkpointing parameters to configuration examples |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with: | ||
| - `use_dist_checkpointing: true` | ||
| - `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/` | ||
|
|
There was a problem hiding this comment.
The documentation shows use_dist_checkpointing: true but the YAML configuration files consistently set this to false by default. This inconsistency could confuse users about the correct value to use when manually converting models.
| If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config with: | |
| - `use_dist_checkpointing: true` | |
| - `dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/` | |
| If you prefer not to use mBridge, set `use_mbridge: false`. Before training, you must first convert your Hugging Face model to the MCore format using the [Hugging Face to MCore converter](https://github.com/volcengine/verl/blob/main/scripts/converter_hf_to_mcore.py) from the **verl** repository. After conversion, update your config as follows: | |
| ```yaml | |
| megatron: | |
| use_mbridge: false | |
| use_dist_checkpointing: true | |
| dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/ | |
| # ... other settings ... |
Description
As the title says.
Checklist
Please check the following items before code is ready to be reviewed.