-
Notifications
You must be signed in to change notification settings - Fork 970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate MS-AMP Support for FP8 as a seperate backend #2232
Conversation
TODO: write some doc guides on MS-AMP |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for enabling MS-AMP FP8 in accelerate. Overall this looks good, but I had a couple of comments, please have a look.
In addition, I have these more general comments:
- IIUC, when using FP8 with TE, when it is detected that the device does not support it, there is an automatic fallback to fp16. Is there a similar mechanism for MS-AMP?
- Implementation-wise, the arguments for FP8 with MS-AMP vs TE are completely disjoint, right? This is a bit unfortunate, as users could e.g. set MS-AMP as backend and change
amax_history_len
and wonder why it has no effect. To be super user-friendly, we would have to add checks and docs that only arguments are changed that are valid for the given backend. A cleaner solution would be to use a completely separatedataclass
for MS-AMP, although that might clash with the accelerate philosophy of abstracting away such implementation details. - I guess no way to run CI tests for this :(
@BenjaminBossan re;
Eventually we will support a "mixed" backend that combines both, as MS-AMP has support for converting |
@BenjaminBossan re; 1, we straight up don't allow it. In if mixed_precision == "fp8" and not is_fp8_available():
raise ValueError("Using `fp8` precision requires `transformer_engine` to be installed.") |
Oh okay. I was looking at this part of the code: accelerate/src/accelerate/accelerator.py Lines 1287 to 1306 in 54d670b
|
We should probably refactor this then as part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Unfortunately, I can't test it :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Just a few nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @muellerzr for working on the MS-AMP FP8 support! ✨ Overall looks good wrt integration and the memory savings of 33% (1/3) for 560M param model. However, It would be great to see more experiments at scale at intermediate model scales of ~10B model sizes as the paper claims:
Experiment results show that, during the training of GPT-175B model
on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a
remarkable 42% reduction in real memory usage but also ran 64% faster than the widely
adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%.
Here, it is odd that we see no savings in time. Maybe is it the case that with reduced memory, they fit larger batches leading to faster training?
@pacman100 I'd expect that likely to be the case, I'll check FLOPS mainly on the scaled up training to see what they can result in :) |
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
cc @MKhalusova for docs and then we can merge 🤗 |
Original paper shows that TE>MS-AMP>BF16 under same batch size setting (paper table 5). It is interesting to see MS-AMP slower than BF16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on the docs! I left a few suggestions :)
I have an inkling model size plays a huge factor in this. With TE I've pretty much always seen it to be slower unless our model size is > 3B |
Co-authored-by: Maria Khalusova <kafooster@gmail.com>
I see! |
I have tried this work,It's amazing. But some questions: GPU :L20
|
Integrate MS-AMP to the
Accelerator
(round 2)What does this add?
This PR introduces an additional backend for FP8 support through MS-AMP which has shown to decrease memory when using FP8 precision while maintaining accuracy
Who is it for?
Individuals training with FP8 (H100/4090's, etc)
Issues linked to
Azure/MS-AMP#128
What parts of the API does this impact?
User-facing:
Two new arguments were added to the
FP8RecipeKwargs
:backend
(str
): Whether a user should use MS-AMP or TE (transformerengine). UsesMS-AMP
by default.optimization_level
(str
), should be one of"O1"
or"O2"
."O3"
is for DeepSpeed and we need to wait for them to update to v0.9.3 of deepspeed to match what Accelerate supportsGeneral guideline to optimization levels:
all_reduce
communications are done in fp8, reducing GPUmemory usage and communication bandwidth
Only available when using Adam or AdamW. This maintains accuracy and can potentially save the highest
memory.
are stored in FP8. If
fp8
is selected and deepspeed is enabled, will be used by default.(Not available currently).
As a result,
"O2"
is the default. Here is an overview of each optimization level and what it does, taken from their docs:Basic Usage Example(s):
A user can either do:
Or use the
FP8RecipeKwargs
:Benchmarks
When running on bloomz-560m I saw a memory decrease of ~1/3.
More experiments need to be conducted on the behavior between TE x MS-AMP wrt performance. For instance, when running my sample script I use here for speed (here) running on the first 100 batches, I saw a stark contrast in the ending training loss between TE and MS-AMP:
BF16 (baseline): 2.4867
TE: 11.3125
MS-AMP: 2.89
I also found overall there wasn't much of a time save with MS-AMP, as it actually added time instead (BF16 was ~0.139s/batch, while MS-AMP was 0.169s/batch). I want to run some more tests to verify but these were some local results.
This performance difference isn't much in the case of BF16 vs MS-AMP, but it is starkly contrast when compared to TE. More work is needed to investigate why, so as a result I've opted to make MS-AMP just an entirely separate backend to use, rather than combine the two.
What went wrong in the last PR
While the training speed results looked very good, I quickly noticed issues with the losses that didn't make sense. Models just simply weren't training or converging, which was surprising. For now taking this more staged approach to the integration while we discover behaviors (both with TE and MS-AMP) through longer training runs.