Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using skip_nan_grad with gradient accumulation for ASR #11272

Open
qhoangdl opened this issue Nov 13, 2024 · 0 comments
Open

using skip_nan_grad with gradient accumulation for ASR #11272

qhoangdl opened this issue Nov 13, 2024 · 0 comments

Comments

@qhoangdl
Copy link

Hi, I notice that while training the ASR fastconformer model on my data, some after a few epochs there can be a batch that cause NaN error. So, I tried setting skip_nan_grad = True, which allow the train_loss curve to look normal (without spiking to NaN), but I notice in the error log that after the message "detected inf or nan values in gradients! Setting gradients to zero.", the all prediction in the following validation rounds are "??" and the validation wer increased to 1. I set accumulate_grad_batches to 4 for the training so I wonder whether this might be the culprit. Does setting skip_nan_grad = True can still work with gradient accumulation. I'd really appreciate any comment you can give on the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant