--iter_per_step eventually hangs with 8x V100 16GB #532

jangalt · 2020-03-30T20:08:25Z

When training the jasper10x5dr asr model, we have access to 8x V100 16GB GPUs from google cloud, vs the 32GB models used in the documentation. We lowered the batch size from 64 to 32, and set iter_per_step to 2 to compensate, but it appears that after around ~8 hours of training the process consistently hangs.

The symptoms:

nvidia-smi reports all 8 V100's at 100% usage, but at idle power draw
8 cores worth of the python training process are capped at 100% CPU, but doing nothing
It usually happens around the evaluation stage, but not always

We have tried with NVIDIA driver 440.33.01 and 418.87.00_1, but the same result happens each time.

The text was updated successfully, but these errors were encountered:

okuchaiev · 2020-03-30T23:33:47Z

(1) Is it reaching end of training and hangs?
(2) which version are u using?

import nemo
nemo.version

jangalt · 2020-03-31T04:18:41Z

No, it will hang in the middle of training, mostly during every 1000th step when the evaluation takes place.
Master branch 9f53ae6, so 0.10.0.b10

OndrejGl · 2020-04-09T08:37:50Z

Hi same here - when training QuartzNet15x5 on 4 GPUs, the procedure stalls after some time, and always at an evaluation. "An" evaluation means, that usually couple of evals are fine, but around 10-ish eval (btw., got my eval frequency set to 500 steps), the training/eval stalls. Ctrl-C interrupts the distributed launch process, but the main computing python processes remain running at 100% CPU and 100% GPU. I have to do killall python.

This is difficult to debug, so I have placed dummy log messages all around the _eval method of the PtActions class, since that's where the stall is happening. The interesting thing is that rank0 process' last message is from around the "tensor_on_worker = registered_e_tensors[key]" line in the _eval method. However, other ranks hang at other places in the train method, and interestingly, from those ranks, the last message (or rather a set of consecutive messages) is:

[NeMo W 2020-04-08 18:34:48 actions:1472] Loss is NaN or inf

I see this message couple of times at other places as well, but what it tells me is that when the eval procedure is in stall, some of the ranks don't even make it to _eval. I will be observing and will eventually write more details.

My guess is that it's some kind of unresolved barrier/processes waiting for each other. I saw a similar behavior long ago when using MPI for training i-vectors - barrier waiting would use 100% of CPU.

Thanks

redoctopus · 2020-04-11T00:07:38Z

Hi, thanks for the info! I've been looking at this deadlock and I believe it's caused by a bug in actions.py (if one worker has NaN loss, it de-syncs from the other workers which causes deadlock during a later backward() all_reduce).

We're currently discussing and testing out a potential fix.

OndrejGl · 2020-04-14T08:54:19Z

Thanks for the update. Speaking about NaN/inf loss - what is the main cause of getting this? I'd assume too little data in the batch?

redoctopus · 2020-04-16T17:46:03Z

Hopefully that PR should fix this issue! Though please be aware that with this change, if you are not using apex.amp O1 or up, training will terminate upon seeing NaN or inf loss on any worker.

We are considering doing something more intelligent like skipping the culprit worker's batch/skipping the one step across all workers in the future, but no promises as to when/if this will be implemented since that would be a much more involved fix. (I made a few small attempts but don't currently have the bandwidth to look into it more at the moment.)

As for NaN/inf loss, too small of a batch or too high lr may be the case. I've also seen this happen with mismatches between the vocabulary of the model vs. the labels (usually an issue with normalization).

okuchaiev · 2020-04-16T21:01:55Z

@jangalt should be fixed in current master by #578
Closing, please re-open if the issue persists for you

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>

redoctopus mentioned this issue Apr 15, 2020

Initial fix for hanging bug when NaN/inf loss is encountered #578

Merged

okuchaiev closed this as completed Apr 16, 2020

redoctopus mentioned this issue May 8, 2020

Loss is NaN or inf when finetuning Quartznet model #613

Closed

dcurran90 pushed a commit to dcurran90/NeMo that referenced this issue Oct 15, 2024

Fix functional test (NVIDIA#532)

918ca11

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--iter_per_step eventually hangs with 8x V100 16GB #532

--iter_per_step eventually hangs with 8x V100 16GB #532

jangalt commented Mar 30, 2020

okuchaiev commented Mar 30, 2020

jangalt commented Mar 31, 2020

OndrejGl commented Apr 9, 2020

redoctopus commented Apr 11, 2020

OndrejGl commented Apr 14, 2020

redoctopus commented Apr 16, 2020

okuchaiev commented Apr 16, 2020

--iter_per_step eventually hangs with 8x V100 16GB #532

--iter_per_step eventually hangs with 8x V100 16GB #532

Comments

jangalt commented Mar 30, 2020

okuchaiev commented Mar 30, 2020

jangalt commented Mar 31, 2020

OndrejGl commented Apr 9, 2020

redoctopus commented Apr 11, 2020

OndrejGl commented Apr 14, 2020

redoctopus commented Apr 16, 2020

okuchaiev commented Apr 16, 2020