Skip to content

Conversation

@lin0303-siyuan
Copy link
Contributor

@lin0303-siyuan lin0303-siyuan commented Dec 11, 2025

Update the docs for lately-supported FSDP backend for slime.

@PopSoda2002
Copy link
Collaborator

@Hecate0821 Lets review this together!

Copy link
Collaborator

@PopSoda2002 PopSoda2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP |
| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter |
| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine) | `--lr-decay-style` (only constant) | |
| **Warmup** | `--lr-warmup-iters` (steps) | Coming Soon | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I think for learning rate related stuff, it has already being supported, check this out: #1040

| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace |
| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same |
| **Offload on Save** | | `--fsdp-state-dict-cpu-offload` (Default True) | **FSDP**: Offload to CPU when saving checkpoint |
Copy link
Collaborator

@Hecate0821 Hecate0821 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to mention state-dict-cpu-offload in user doc, it should always be True.

| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | |
| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP |
| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter |
| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine) | `--lr-decay-style` (only constant) | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for lr decay, not only constant is supported? You can checkout that PR

Copy link
Collaborator

@PopSoda2002 PopSoda2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@PopSoda2002 PopSoda2002 merged commit 4eaee1e into THUDM:main Dec 18, 2025
Fengzdadi pushed a commit to Fengzdadi/slime that referenced this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants