-
Notifications
You must be signed in to change notification settings - Fork 538
[WIP] distributed training #1334
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and building MXNet from source will work for distributed training
Codecov Report
@@ Coverage Diff @@
## master #1334 +/- ##
=======================================
Coverage 67.34% 67.34%
=======================================
Files 2 2
Lines 98 98
=======================================
Hits 66 66
Misses 32 32 Continue to review full report at Codecov.
|
Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (bytedance/byteps#292). |
@ZiyueHuang You may merge the upstream/master since we recently fixed the CI. |
What are they? |
Horovod also has similar issue that the actual broadcast happens after the first iteration. See here |
To avoid the problems of horovod, in GluonNLP v1, the convention is to not rely on deferred initialization when implementing the model. For example, for dense layer, we should always give the in_units. |
Sorry for the late reply. @szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the stack trace
@sxjscience @szhengac Horovod doesn't have similar issue, since in |
You can report it as a bug in MXNet issue tracker |
Description
Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each has multiple GPU cards. However, there are two issues:
We should first call
trainer._init_params()
before the first iteration (forward/backward) to synchronize the parameters across all workers, otherwisetrainer
will call_init_params
insideallreduce_grads
(see mxnet/python/mxnet/gluon/trainer.py), thus the synchronization of the parameters actually happen after the first forward/backward computation, meaning that in the first iteration the gradients on different workers are computed w.r.t. different parameters. But in practice this may not be a severe problem, as only the first gradient descent step is not totally correct.When the model in gluon-nlp confirms to the new coding standard (removing defer_init for all parameters @sxjscience ), then we can directly call
trainer._init_params
aftermodel.initialize
.Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as
(0, 5)
) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.I found that we have to build MXNet and BytePS from source on our target machine. Using
pip install mxnet
(I have tried several wheels in August) will not work for BytePS, either core dump immediately afterimport bps.mxnet
due to undefined symbols (seems because some symbols are not exported into binary, fixed after 0820 wheel), or mysterious segfault (maybe due to c++ ABI issues or other compiler issues, as the mxnet wheels and BytePS (which is setup on our target machine) may not be compiled using the same version of gcc. Besides, there are no related compiler flags such as D_GLIBCXX_USE_CXX11_ABI for MXNet in BytePS's setup.py).Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team