-
Notifications
You must be signed in to change notification settings - Fork 500
Feat: add usage docs for fsdp #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Hecate0821 Lets review this together! |
PopSoda2002
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
docs/en/get_started/usage.md
Outdated
| | **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP | | ||
| | **Initial Learning Rate** | `--lr` | `--lr` | Same parameter | | ||
| | **Learning Rate Decay** | `--lr-decay-style` (linear/cosine) | `--lr-decay-style` (only constant) | | | ||
| | **Warmup** | `--lr-warmup-iters` (steps) | Coming Soon | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think for learning rate related stuff, it has already being supported, check this out: #1040
docs/en/get_started/usage.md
Outdated
| | **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled | | ||
| | **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace | | ||
| | **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same | | ||
| | **Offload on Save** | | `--fsdp-state-dict-cpu-offload` (Default True) | **FSDP**: Offload to CPU when saving checkpoint | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to mention state-dict-cpu-offload in user doc, it should always be True.
docs/en/get_started/usage.md
Outdated
| | **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | | | ||
| | **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP | | ||
| | **Initial Learning Rate** | `--lr` | `--lr` | Same parameter | | ||
| | **Learning Rate Decay** | `--lr-decay-style` (linear/cosine) | `--lr-decay-style` (only constant) | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for lr decay, not only constant is supported? You can checkout that PR
PopSoda2002
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Feat: add usage docs for fsdp
Update the docs for lately-supported FSDP backend for slime.