-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Training hangs when each process is trained on different number of batches #2223
Comments
The common practice with data parallel training is for the gpus to process the same batch size. The hang you are experiencing is unsurprising since imbalance in amount of work done by the gpus will break the tightly coordinated synchronizations. Also, the computation will break since gradient reduction assumes each gpu processes the same amount of data. Can you please explain the motivation for wanting the gpus to process different amounts of data? |
Thanks for your quick response. I (and other people) encountered this issue when trying to reproduce a result. I dont see a strong motivation for each gpu doing different amount of work. Thanks again for the clarification. |
Describe the bug
I am trying to reproduce the work "Training Bert on academic budget" using the codebase provided by the authors. https://github.com/IntelLabs/academic-budget-bert.
However when training, the model gets stuck after a few epochs. The root cause is — each process trains on batches of different size and its gets stuck during the ".step()" call on deepspeed engine object. Specifically the line: https://github.com/IntelLabs/academic-budget-bert/blob/04f6da685acf4dfc47b85b42307e17340e87fde3/run_pretraining.py#L219
Similar behaviour is also observed in this issue Lightning-AI/pytorch-lightning#13498
This issue completely disappears, if same number of batches are provided to all the processes.
To Reproduce
Steps to reproduce the behavior:
Rank = 0, Number of batches 4
Rank = 1 Number of batches 4
Rank = 2, Number of batches 8
Rank = 3, Number of batches 4
Expected behavior
Training should not hang
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
deepspeed run_pretraining.py
The text was updated successfully, but these errors were encountered: