Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505

KunalDhawan · 2023-09-25T20:47:43Z

What does this PR do ?

This PR added fixes for PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue

Collection:
ASR, Core

Changelog

val_dataloader() and test_dataloader() functions in nemo/core/classes/modelPT.py updated to handle cases where dataloader is None
correct logging of validation metrics in nemo/collections/asr/models/rnnt_models.py and nemo/collections/asr/models/hybrid_rnnt_ctc_models.py

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Additional Information

Related to Bug4296343, Bug4296057, Bug4246804
Fixes issues raised in val_wer #7450, Fastconform SSL pretrained recipe failed potentially due to PTL 2.0 #7507

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

for more information, see https://pre-commit.ci

nithinraok

LGTM.. did you check for other models as well?

nithinraok · 2023-09-25T22:07:51Z

nemo/collections/asr/models/hybrid_rnnt_ctc_models.py

@@ -601,6 +601,9 @@ def validation_pass(self, batch, batch_idx, dataloader_idx):
        if AccessMixin.is_access_enabled():
            AccessMixin.reset_registry(self)

+        # adding this as return values are no longer logger automatically in PTL2.0


remove the comments

Fixed in latest commit. Also added necessary logging changes for label_models, slu_models, ssl_models

Thanks Kunal. @athitten how is this different from adding to test_step_outputs or validation_step_outputs ?

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

stevehuang52 · 2023-09-26T07:09:34Z

nemo/collections/asr/models/ssl_models.py

+
+        self.log_dict(val_log_dict)
+
+        return val_log_dict

    def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):
        val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()


Is loss_val_mean this one the loss value that should be logged? I mean, should we add logging in the multi_validation_epoch_end function instead? the one in validation_step is per-batch, but we need to monitor the mean loss over all validation data. similar for test_step and multi_test_epoch_end

The logging has to be done at validation step itself akin to the change introduced in this PR for PTL upgrade - https://github.com/NVIDIA/NeMo/pull/6433/files#diff-b2780d88910b132d177fb0081453ad276c5e4aefe47a87f219e96f38af0625be

If we do the logging at multi_validation_epoch_end and multi_test_epoch_end, we still get the current error - ModelCheckpoint(monitor='val_wer') could not find the monitored key in the returned metrics: ['train_loss', 'learning_rate', 'global_step', 'train_backward_timing in s', 'train_step_timing in s', 'training_batch_wer', 'epoch', 'step']. HINT: Did you call log('val_wer', value) in the LightningModule?

I'm refactoring the PR currently to make the logging similar to how we do it for ctc_models. I'll make the change for RNNT and Hybrid models for now, maybe we can open another PR next to address these issues for the SLU, SSL and label models

New PR - #7531

stevehuang52 · 2023-09-26T07:32:39Z

nemo/collections/asr/models/label_models.py

            f'{tag}_loss': loss_value,
            f'{tag}_correct_counts': correct_counts,
            f'{tag}_total_counts': total_counts,
            f'{tag}_acc_micro_top_k': acc_top_k,
            f'{tag}_acc_macro_stats': stats,
        }

+        self.log_dict(eval_dict)


not all variables in eval_dict need logging, please move self.log() into multi_evaluation_epoch_end where it calculates the averaged metrics

I think all these variables were logged previously too (example run - https://wandb.ai/nvidia/titanet-chime7-training?workspace=user-kdhawan), please let me know if you want me to remove some of these from the log

KunalDhawan · 2023-09-27T18:21:02Z

Opened a new PR for this issue - #7531

KunalDhawan added 2 commits September 25, 2023 11:30

Fix None dataloader issue in PTL2.0

cabc2f4

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

fixed val metric logging for rnnt and hybrid models

6ea5a2d

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

KunalDhawan requested review from titu1994, nithinraok and athitten September 25, 2023 20:47

github-actions bot added core Changes to NeMo Core ASR labels Sep 25, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

1da54a4

for more information, see https://pre-commit.ci

XuesongYang mentioned this pull request Sep 25, 2023

Fastconform SSL pretrained recipe failed potentially due to PTL 2.0 #7507

Closed

nithinraok reviewed Sep 25, 2023

View reviewed changes

XuesongYang mentioned this pull request Sep 25, 2023

val_wer #7450

Closed

KunalDhawan added 2 commits September 25, 2023 15:49

removed comments as per PR discussion

f87f54d

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

added logging fixes for label_models, slu_models, ssl_models

43c80b2

Signed-off-by: KunalDhawan <kunaldhawan97@gmail.com>

KunalDhawan requested review from nithinraok and stevehuang52 September 25, 2023 23:06

stevehuang52 reviewed Sep 26, 2023

View reviewed changes

KunalDhawan closed this Sep 26, 2023

KunalDhawan deleted the kdhawan/fix_asr_ptl_bugs_r1.21 branch September 26, 2023 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505

Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505

KunalDhawan commented Sep 25, 2023 •

edited

Loading

nithinraok left a comment

nithinraok Sep 25, 2023

KunalDhawan Sep 25, 2023

nithinraok Sep 25, 2023

stevehuang52 Sep 26, 2023 •

edited

Loading

KunalDhawan Sep 26, 2023

KunalDhawan Sep 26, 2023

KunalDhawan Sep 26, 2023

KunalDhawan Sep 27, 2023

stevehuang52 Sep 26, 2023

KunalDhawan Sep 26, 2023

KunalDhawan commented Sep 27, 2023

Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505

Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue #7505

Conversation

KunalDhawan commented Sep 25, 2023 • edited Loading

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

nithinraok left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevehuang52 Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KunalDhawan commented Sep 27, 2023

KunalDhawan commented Sep 25, 2023 •

edited

Loading

stevehuang52 Sep 26, 2023 •

edited

Loading