-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training hangs with DeepSpeed when DDP workers have different number of training batches #13498
Comments
This is not supported currently. We don't support uneven dataset sizes. Besides, the following
would lead to a different set of problems too. I highly recommend you reconsider your approach. Sorry if this wasn't clear from the docs/tutorials. |
Thank you for the clarification. |
🐛 Bug
My use case involves streaming a large dataset for distributed training. During this process, each distributed worker may get different number of training batches. Please see the boring model example bellow for an equivalent case.
When turning DeepSpeed integration on, the code hangs after one full epoch. All GPUs have 100% utilization, while GPU power remains low. I cannot pinpoint the error as keyboard interrupt wouldn't work and I have to kill everything.
The training does not hang if DeepSpeed is turned off. I'm not quite sure if this is a lightning bug or a DeepSpeed bug.
To Reproduce
Expected behavior
Training finishes without hanging.
Environment
- GPU:
- A100-SXM4-40GB
- A100-SXM4-40GB
- available: True
- version: 11.3
- numpy: 1.21.6
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.7.12
Additional context
cc @justusschock @awaelchli @ninginthecloud @rohitgr7 @otaj @SeanNaren @akihironitta
The text was updated successfully, but these errors were encountered: