-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--iter_per_step eventually hangs with 8x V100 16GB #532
Comments
(1) Is it reaching end of training and hangs?
|
|
Hi same here - when training QuartzNet15x5 on 4 GPUs, the procedure stalls after some time, and always at an evaluation. "An" evaluation means, that usually couple of evals are fine, but around 10-ish eval (btw., got my eval frequency set to 500 steps), the training/eval stalls. Ctrl-C interrupts the distributed launch process, but the main computing python processes remain running at 100% CPU and 100% GPU. I have to do killall python. This is difficult to debug, so I have placed dummy log messages all around the _eval method of the PtActions class, since that's where the stall is happening. The interesting thing is that rank0 process' last message is from around the "tensor_on_worker = registered_e_tensors[key]" line in the _eval method. However, other ranks hang at other places in the train method, and interestingly, from those ranks, the last message (or rather a set of consecutive messages) is:
I see this message couple of times at other places as well, but what it tells me is that when the eval procedure is in stall, some of the ranks don't even make it to _eval. I will be observing and will eventually write more details. My guess is that it's some kind of unresolved barrier/processes waiting for each other. I saw a similar behavior long ago when using MPI for training i-vectors - barrier waiting would use 100% of CPU. Thanks |
Hi, thanks for the info! I've been looking at this deadlock and I believe it's caused by a bug in actions.py (if one worker has NaN loss, it de-syncs from the other workers which causes deadlock during a later backward() all_reduce). We're currently discussing and testing out a potential fix. |
Thanks for the update. Speaking about NaN/inf loss - what is the main cause of getting this? I'd assume too little data in the batch? |
Hopefully that PR should fix this issue! Though please be aware that with this change, if you are not using apex.amp O1 or up, training will terminate upon seeing NaN or inf loss on any worker. We are considering doing something more intelligent like skipping the culprit worker's batch/skipping the one step across all workers in the future, but no promises as to when/if this will be implemented since that would be a much more involved fix. (I made a few small attempts but don't currently have the bandwidth to look into it more at the moment.) As for NaN/inf loss, too small of a batch or too high lr may be the case. I've also seen this happen with mismatches between the vocabulary of the model vs. the labels (usually an issue with normalization). |
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
When training the jasper10x5dr asr model, we have access to 8x V100 16GB GPUs from google cloud, vs the 32GB models used in the documentation. We lowered the batch size from 64 to 32, and set iter_per_step to 2 to compensate, but it appears that after around ~8 hours of training the process consistently hangs.
The symptoms:
nvidia-smi
reports all 8 V100's at 100% usage, but at idle power drawWe have tried with NVIDIA driver 440.33.01 and 418.87.00_1, but the same result happens each time.
The text was updated successfully, but these errors were encountered: