Add Compressedbackend for Onebit optimizers#5473
Add Compressedbackend for Onebit optimizers#5473tjruwase merged 19 commits intodeepspeedai:masterfrom
Conversation
csrc/xpu/packbits/packing.cpp
Outdated
|
|
||
| at::Tensor packbits(at::Tensor tensor, int input_size, int rank) | ||
| { | ||
| /* |
There was a problem hiding this comment.
@Liangliang-Ma the function documentation needs to be moved to line 39 right before the function def line.
|
@tjruwase this PR is an approach to abstract the generic part of 1bit-adam and implment accelerator dependent part with DeepSpeed custom op builder. So 1bit-adam does not need to depend on accelerator specific libraries. @inkcherry I remember you investigated in 1bit adam portability before, FYI this PR implement a portable version of 1bit adam support. |
|
Hi @tjruwase , could you please help to review this PR? Thanks! |
add README.md for onebit tests
|
@tjruwase I have noticed that in onebit unit test, the onebit comm backend is assigned like this: |
|
@tjruwase Hi, May I ask if you could help to review my last comment or merge this one first? Thanks |
|
@Liangliang-Ma, apologies for delay. I am still thinking about your last comment, but will not delay this PR. |
In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This one is document supplement for #5473. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This one is document supplement for deepspeedai#5473. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of
compressed_allreducelies onpackbitsandunpackbits. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do thecompressed_allreducework where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators.In this PR, we: