CPU-Adam fix for scalar mode #735

RezaYazdaniAminabadi · 2021-02-08T18:17:00Z

This PR fixes the issue for the DS CPU-Adam when it goes through the scalar mode, when running on the CPUs that SIMD instructions are not supported.

Regarding the performance, I see 6x to 7x improvement over torch.adam. For instance it reduces the optimizer time from 56 seconds to 8.2 seconds to an 11.6B parameter GPT2 model.

TODO: CPU-Adam currently supports AVX instructions from Intel architecture. In the next PR, I will try to add the AMD AVX support to speed up the performance for these architectures too.

stas00 · 2021-02-08T20:01:09Z

I don't observe any speedups, but this PR doesn't segfault! Thank you!

RezaYazdaniAminabadi · 2021-02-08T20:02:39Z

I don't observe any speedups, but this PR doesn't segfault! Thank you!

What latency do you see for optimizer part?

stas00 · 2021-02-08T20:52:15Z

to get the stats, I had to change the config to:

 "wall_clock_breakdown": true,

Note that in the following stats the sample was taken around step 30 once the optimizer actually kicked in - it skips the first 25 steps on this particular setup/taks.

with OMP_NUM_THREADS=1 and cpu adam:

rank=0 time (ms) | forward_microstep: 102.17 | backward_microstep: 3096.23 | 
backward_inner_microstep: 3056.99 | backward_allreduce_microstep: 39.19 | step_micro
step: 4442.90
rank=0 time (ms) | forward: 102.15 | backward: 3096.22 | backward_inner: 3056.94 | 
backward_allreduce: 39.17 | step: 4442.88

with OMP_NUM_THREADS=96 (single proc) and cpu adam:

rank=0 time (ms) | forward_microstep: 102.36 | backward_microstep: 3099.20 | 
backward_inner_microstep: 3060.05 | backward_allreduce_microstep: 39.10 | step_micro
step: 4184.67
rank=0 time (ms) | forward: 102.34 | backward: 3099.18 | backward_inner: 3060.03 | 
backward_allreduce: 39.09 | step: 4184.65

a bit faster 4.2 sec vs 4.4 sec with 1 thread

with OMP_NUM_THREADS=96 (single proc) and torch.optim.Adam

rank=0 time (ms) | forward_microstep: 102.45 | backward_microstep: 3162.71 | 
backward_inner_microstep: 3145.74 | backward_allreduce_microstep: 16.90 | step_micro
step: 21595.02
rank=0 time (ms) | forward: 102.43 | backward: 3162.68 | backward_inner: 3145.71 | 
backward_allreduce: 16.87 | step: 21595.00

so 5x times slower.

cpu-adam config:

   "optimizer": {
     "type": "AdamW",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

torch's Adam config:

   "optimizer": {
     "type": "Adam",
     "params": {
       "torch_adam": true,
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

RezaYazdaniAminabadi · 2021-02-08T20:53:19Z

to get the stats, I had to change the config to:
 "wall_clock_breakdown": true,
Note that in the following stats the sample was taken around step 30 once the optimizer actually kicked in - it skips the first 25 steps on this particular setup/taks.

with OMP_NUM_THREADS=1 and cpu adam:
rank=0 time (ms) | forward_microstep: 102.17 | backward_microstep: 3096.23 | backward_inner_microstep: 3056.99 | backward_allreduce_microstep: 39.19 | step_micro
step: 4442.90
rank=0 time (ms) | forward: 102.15 | backward: 3096.22 | backward_inner: 3056.94 | backward_allreduce: 39.17 | step: 4442.88
with OMP_NUM_THREADS=96 (single proc) and cpu adam:
rank=0 time (ms) | forward_microstep: 102.36 | backward_microstep: 3099.20 | backward_inner_microstep: 3060.05 | backward_allreduce_microstep: 39.10 | step_micro
step: 4184.67
rank=0 time (ms) | forward: 102.34 | backward: 3099.18 | backward_inner: 3060.03 | backward_allreduce: 39.09 | step: 4184.65
a bit faster 4.2 sec vs 4.4 sec with 1 thread

with OMP_NUM_THREADS=96 (single proc) and torch.optim.Adam
rank=0 time (ms) | forward_microstep: 102.45 | backward_microstep: 3162.71 | backward_inner_microstep: 3145.74 | backward_allreduce_microstep: 16.90 | step_micro
step: 21595.02
rank=0 time (ms) | forward: 102.43 | backward: 3162.68 | backward_inner: 3145.71 | backward_allreduce: 16.87 | step: 21595.00
so 5x times slower. the config to change to torch is:
     "type": "Adam",
     "params": {
       "torch_adam": true,
       ...

Thank you for checking the performance diff! :)

jeffra

Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?

RezaYazdaniAminabadi · 2021-02-09T17:00:34Z

Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?

To repro the issue, ds-adam requires to be compiled in scalar mode! Currently, we have it automated based on the architecture capability. I think one possibility is to use a parameter to select between different compute sources and test the old issue. Do you have another opinion here?

jeffra · 2021-02-09T17:15:13Z

Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?

To repro the issue, ds-adam requires to be compiled in scalar mode! Currently, we have it automated based on the architecture capability. I think one possibility is to use a parameter to select between different compute sources and test the old issue. Do you have another opinion here?

That sounds like a good idea to me. Maybe we can add a new environment variable to this part of the code (https://github.com/microsoft/DeepSpeed/blob/master/op_builder/cpu_adam.py#L24) to allow for different SIMD_WIDTH values. If the var isn't set then we fall back to detection like we have currently. Then in the unit test we'll need a way to force re-compile the op with JIT, which can be done by wiping out the folder at TORCH_EXTENSIONS_DIR. Once you have the test that triggers with scalar i'd be happy to write the hooks needed do the env variable pieces.

stas00 · 2021-02-09T17:15:49Z

Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?

To repro the issue, ds-adam requires to be compiled in scalar mode! Currently, we have it automated based on the architecture capability. I think one possibility is to use a parameter to select between different compute sources and test the old issue. Do you have another opinion here?

You could monkey patch the function that returns the arch capability for the duration of the sub-test and lie that it's AMD ;)

jeffra · 2021-02-09T17:18:55Z

Can we add a unit test related to this change? Do we have a way to repro the old issue and create the test based on that?

To repro the issue, ds-adam requires to be compiled in scalar mode! Currently, we have it automated based on the architecture capability. I think one possibility is to use a parameter to select between different compute sources and test the old issue. Do you have another opinion here?

You could monkey patch the function that returns the arch capability for the duration of the sub-test and lie that it's AMD ;)

I think that's a great idea for another unit test here. I think we want (at least) two unit tests here: (1) test that auto-detect AMD triggers the scalar setup, (2) test that the new scalar logic works when enabled. Unfortunately our CI setup doesn't include any AMD cpus currently so this will be tricky for now but I think we can add an A100 runner temporarily to test this. We've been experimenting with adding temporary github action runners and it seems to be working out pretty well so far.

RezaYazdaniAminabadi · 2021-02-09T18:03:35Z

Thanks for the good suggestions, I will later add these unit tests.

continue to copy in tiled manner when in scalar mode

feaf248

RezaYazdaniAminabadi requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam, ShadenSmith and tjruwase as code owners February 8, 2021 18:17

jeffra linked an issue Feb 8, 2021 that may be closed by this pull request

Segfault when training large GPT2 models on single GPU #679

Closed

RezaYazdaniAminabadi mentioned this pull request Feb 8, 2021

Segfault when training large GPT2 models on single GPU #679

Closed

Reza Yazdani added 2 commits February 8, 2021 20:36

guarding the tiled-copy overlapping

d3dbec7

move stream synchronize in class method

e347b64

Merge branch 'master' into reyazda/adam-scalar-fix

efec7a2

eltonzheng approved these changes Feb 8, 2021

View reviewed changes

jeffra approved these changes Feb 9, 2021

View reviewed changes

SeanNaren mentioned this pull request Feb 13, 2021

DeepSpeed Integration Lightning-AI/pytorch-lightning#5954

Merged

15 tasks

Merge branch 'master' into reyazda/adam-scalar-fix

bd3c9aa

jeffra merged commit ee1ffe2 into master Feb 18, 2021

RezaYazdaniAminabadi mentioned this pull request Feb 26, 2021

Does AMD CPU support cpu_adam? DS_BUILD_CPU_ADAM=1 causes compilation error #788

Closed

stas00 mentioned this pull request Apr 3, 2021

serious AMD problems on some specific hardware #925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU-Adam fix for scalar mode #735

CPU-Adam fix for scalar mode #735

RezaYazdaniAminabadi commented Feb 8, 2021

stas00 commented Feb 8, 2021

RezaYazdaniAminabadi commented Feb 8, 2021

stas00 commented Feb 8, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Feb 8, 2021

jeffra left a comment

RezaYazdaniAminabadi commented Feb 9, 2021

jeffra commented Feb 9, 2021

stas00 commented Feb 9, 2021

jeffra commented Feb 9, 2021 •

edited

Loading

RezaYazdaniAminabadi commented Feb 9, 2021

CPU-Adam fix for scalar mode #735

CPU-Adam fix for scalar mode #735

Conversation

RezaYazdaniAminabadi commented Feb 8, 2021

stas00 commented Feb 8, 2021

RezaYazdaniAminabadi commented Feb 8, 2021

stas00 commented Feb 8, 2021 • edited Loading

RezaYazdaniAminabadi commented Feb 8, 2021

jeffra left a comment

Choose a reason for hiding this comment

RezaYazdaniAminabadi commented Feb 9, 2021

jeffra commented Feb 9, 2021

stas00 commented Feb 9, 2021

jeffra commented Feb 9, 2021 • edited Loading

RezaYazdaniAminabadi commented Feb 9, 2021

stas00 commented Feb 8, 2021 •

edited

Loading

jeffra commented Feb 9, 2021 •

edited

Loading