You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My use case involves streaming a large dataset for distributed training. During this process, each distributed worker may get different number of training batches. Please see the boring model example bellow for an equivalent case.
When turning DeepSpeed integration on, the code hangs after one full epoch. All GPUs have 100% utilization, while GPU power remains low. I cannot pinpoint the error as keyboard interrupt wouldn't work and I have to kill everything.
The training does not hang if DeepSpeed is turned off. I'm not quite sure if this is a lightning bug or a DeepSpeed bug.
would lead to a different set of problems too.
Make sure you return the same dataset in all ranks so that the distributed sampler can shard the data across all devices equally. The data should be split equally between the ranks because there is no reason why one GPU should do more work while the others should stay idle.
I highly recommend you reconsider your approach. Sorry if this wasn't clear from the docs/tutorials.
🐛 Bug
My use case involves streaming a large dataset for distributed training. During this process, each distributed worker may get different number of training batches. Please see the boring model example bellow for an equivalent case.
When turning DeepSpeed integration on, the code hangs after one full epoch. All GPUs have 100% utilization, while GPU power remains low. I cannot pinpoint the error as keyboard interrupt wouldn't work and I have to kill everything.
The training does not hang if DeepSpeed is turned off. I'm not quite sure if this is a lightning bug or a DeepSpeed bug.
To Reproduce
Expected behavior
Training finishes without hanging.
Environment
- GPU:
- A100-SXM4-40GB
- A100-SXM4-40GB
- available: True
- version: 11.3
- numpy: 1.21.6
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu113
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.7.12
Additional context
cc @justusschock @awaelchli @ninginthecloud @rohitgr7 @otaj @SeanNaren @akihironitta
The text was updated successfully, but these errors were encountered: