-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU-Adam fix for scalar mode #735
Conversation
I don't observe any speedups, but this PR doesn't segfault! Thank you! |
What latency do you see for optimizer part? |
to get the stats, I had to change the config to:
Note that in the following stats the sample was taken around step 30 once the optimizer actually kicked in - it skips the first 25 steps on this particular setup/taks.
a bit faster 4.2 sec vs 4.4 sec with 1 thread
so 5x times slower. cpu-adam config:
torch's Adam config:
|
Thank you for checking the performance diff! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?
To repro the issue, ds-adam requires to be compiled in scalar mode! Currently, we have it automated based on the architecture capability. I think one possibility is to use a parameter to select between different compute sources and test the old issue. Do you have another opinion here? |
That sounds like a good idea to me. Maybe we can add a new environment variable to this part of the code (https://github.com/microsoft/DeepSpeed/blob/master/op_builder/cpu_adam.py#L24) to allow for different SIMD_WIDTH values. If the var isn't set then we fall back to detection like we have currently. Then in the unit test we'll need a way to force re-compile the op with JIT, which can be done by wiping out the folder at TORCH_EXTENSIONS_DIR. Once you have the test that triggers with scalar i'd be happy to write the hooks needed do the env variable pieces. |
You could monkey patch the function that returns the arch capability for the duration of the sub-test and lie that it's AMD ;) |
I think that's a great idea for another unit test here. I think we want (at least) two unit tests here: (1) test that auto-detect AMD triggers the scalar setup, (2) test that the new scalar logic works when enabled. Unfortunately our CI setup doesn't include any AMD cpus currently so this will be tricky for now but I think we can add an A100 runner temporarily to test this. We've been experimenting with adding temporary github action runners and it seems to be working out pretty well so far. |
Thanks for the good suggestions, I will later add these unit tests. |
This PR fixes the issue for the DS CPU-Adam when it goes through the scalar mode, when running on the CPUs that SIMD instructions are not supported.
Regarding the performance, I see 6x to 7x improvement over torch.adam. For instance it reduces the optimizer time from 56 seconds to 8.2 seconds to an 11.6B parameter GPT2 model.
TODO: CPU-Adam currently supports AVX instructions from Intel architecture. In the next PR, I will try to add the AMD AVX support to speed up the performance for these architectures too.