-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505
Changes from all commits
cabc2f4
6ea5a2d
1da54a4
f87f54d
43c80b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -554,9 +554,11 @@ def validation_step(self, batch, batch_idx, dataloader_idx=0): | |
self.reset_registry() | ||
del self._in_validation_step | ||
|
||
return { | ||
'val_loss': loss_value, | ||
} | ||
val_log_dict = {'val_loss': loss_value} | ||
|
||
self.log_dict(val_log_dict) | ||
|
||
return val_log_dict | ||
|
||
def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0): | ||
val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The logging has to be done at validation step itself akin to the change introduced in this PR for PTL upgrade - https://github.com/NVIDIA/NeMo/pull/6433/files#diff-b2780d88910b132d177fb0081453ad276c5e4aefe47a87f219e96f38af0625be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do the logging at multi_validation_epoch_end and multi_test_epoch_end, we still get the current error - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm refactoring the PR currently to make the logging similar to how we do it for ctc_models. I'll make the change for RNNT and Hybrid models for now, maybe we can open another PR next to address these issues for the SLU, SSL and label models There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. New PR - #7531 |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all variables in
eval_dict
need logging, please moveself.log()
intomulti_evaluation_epoch_end
where it calculates the averaged metricsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all these variables were logged previously too (example run - https://wandb.ai/nvidia/titanet-chime7-training?workspace=user-kdhawan), please let me know if you want me to remove some of these from the log