-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[v1.x][KVStore]1Bit gradient compression #17952
Conversation
Hey @shuo-ouyang , Thanks for submitting the PR
CI supported jobs: [centos-gpu, miscellaneous, clang, unix-cpu, centos-cpu, sanity, website, windows-gpu, windows-cpu, unix-gpu, edge] Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Work! LGTM : )
Thanks for your contribution!
@mxnet-bot run ci [unix-gpu, centos-gpu] |
Jenkins CI successfully triggered : [unix-gpu, centos-gpu] |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@mxnet-label-bot add [KVStore, Distributed] |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@wkcn Thanks for your help and support. To be honest, I am not familiar with jenkins workflow, and I dont know why some checks are failed when only one whitespace was deleted in code comments. |
@shuo-ouyang It is not related to the code. The reason is that the CI is unstable now. |
@wkcn I get it, thanks! |
Hi @rahul003 , could you please help take a review? Thank you! |
@mxnet-bot run ci [centos-gpu] |
Jenkins CI successfully triggered : [centos-gpu] |
Hi @eric-haibin-lin and @szha , could you please help take a reivew? Thank you! |
@mxnet-bot run ci [all] |
Jenkins CI successfully triggered : [edge, centos-gpu, windows-cpu, centos-cpu, windows-gpu, clang, unix-cpu, miscellaneous, website, unix-gpu, sanity] |
Hi @eric-haibin-lin , could you please help take a review? Thank you! |
Hi @szha , could the PR be merged ? |
@wkcn since @eric-haibin-lin recently updated the kvstore interface, it would be best to have a review from him. @eric-haibin-lin can you help? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Shall we add tests in https://github.com/apache/incubator-mxnet/blob/cb54a4a99463b23b8abaa2629661954c4ba3c60b/tests/nightly/dist_sync_kvstore.py#L93 and https://github.com/apache/incubator-mxnet/blob/7caffa65e30f37e70796ba165ac5a4265e64974e/tests/nightly/test_kvstore.py#L34 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the end2end speedup when using the 1bit compressor?
The accuracy drop on resnet20 is also very significant..
db71b9c
to
674b3b4
Compare
674b3b4
to
4710211
Compare
@eric-haibin-lin |
4710211
to
9512cd3
Compare
The results show that the 2-bit compression results in test perplexity higher than 1-bit compression, which seems surprising. Any explanation on why it's the case? |
The reason may be that the onebit quantizer can exactly capture the direction (sign) of each element in gradients when the threshold is 0, whereas twobit quantizer cannot due to its limination (threshold != 0). By the way, the previous results are trained with additional code modification about error feedback part, which may lead to counterintuitive results. We have corrected them and uploaded the right results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The change looks good to me.
Description
Added 1bit gradient compression implementation which has similar speedup with 2bit compression. It works with a threshold, values in gradient above the threshold will be quantized to +1, while values below the threshold will be quantizaed to -1. Different with 2bit compression, this implementation of 1bit supports zero threshold. In addition, 1bit compression seems perform better than the current implentation of 2bit compression when batch size increase.
Important files to review
gradient_compression-inl.h
gradient_compression.cc
Test accuracy of ResNet110 on cifar10 with 1/2/4 workers
each worker is equipped with 4 Tesla V100 GPUs
training command:
1worker
2workers
4workers
4workers, update_on_kvstore=false
Result of 2-layer LSTM trained on PTB with 1 worker
train perplexity
test perplexity
Throughput
In this part, we use example/imageclassification/benchmark.py to evaluate the performance of gradient compression. We have tried our best to optimize the kernel function of gradient compression operations. In the original kernel function, each thread manipulates 32 bits (4 bytes) of compressed gradients, whereas in our kernel function, each thread writes 8 bits (1 byte) at once. The difference between the original and our implementations are follows.
original kernel implementation:
our kernel implementation:
Our optimized implementation performs well when use multiple GPUs on a single machine. Here is a benchmark test of the VGG16 model on a 4 GPUs machine (unit: samples/sec).
However, our implementation works not well when the compression operator is lunched on the CPU, especially when we use multiple nodes and set
kvstore=dist_sync
. Under such circumstance, our new implementation may lead to a little throughput reduction. IMHO, one possible solution is adopting different kernel functions for CPU and GPU respectively. In other words, we can use the original kernel for CPU compression operation while the new kernel for GPU operation (still working in process...).The following are benchmark tests of AlexNet and Resnet-50 on a cluster with at most 4 nodes.
Alexnet, batch size=128 on each GPU, dist_sync_device
ResNet-50, batch size=128 on each GPU, dist_sync_device
more performance test will be released recently...
Related issue or pr
signum with grad compression #9558
2bit gradient compression #8662
Reference
Seide F, Fu H, Droppo J, et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns
Acknowledgment
Thanks to HiNA group for providing the experiments testbed.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments