You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I notice that while training the ASR fastconformer model on my data, some after a few epochs there can be a batch that cause NaN error. So, I tried setting skip_nan_grad = True, which allow the train_loss curve to look normal (without spiking to NaN), but I notice in the error log that after the message "detected inf or nan values in gradients! Setting gradients to zero.", the all prediction in the following validation rounds are "??" and the validation wer increased to 1. I set accumulate_grad_batches to 4 for the training so I wonder whether this might be the culprit. Does setting skip_nan_grad = True can still work with gradient accumulation. I'd really appreciate any comment you can give on the issue!
The text was updated successfully, but these errors were encountered:
Hi, I notice that while training the ASR fastconformer model on my data, some after a few epochs there can be a batch that cause NaN error. So, I tried setting skip_nan_grad = True, which allow the train_loss curve to look normal (without spiking to NaN), but I notice in the error log that after the message "detected inf or nan values in gradients! Setting gradients to zero.", the all prediction in the following validation rounds are "??" and the validation wer increased to 1. I set accumulate_grad_batches to 4 for the training so I wonder whether this might be the culprit. Does setting skip_nan_grad = True can still work with gradient accumulation. I'd really appreciate any comment you can give on the issue!
The text was updated successfully, but these errors were encountered: