Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling sanity checking when num_sanity_val_steps=0 #7413

Merged
merged 4 commits into from
Sep 12, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,9 @@ def _reconfigure_val_batches(self):
"""
# Override limit_val_batches to be a multiple of num microbatches and so there are limit_val_batches//num_micro_batches num of global batches
self.trainer.limit_val_batches *= get_num_microbatches()
# Override num sanity steps equal to num of microbatches and perform one val_step
self.trainer.num_sanity_val_steps = get_num_microbatches()
# Override num sanity steps equal to num of microbatches and perform one val_step if num_sanity_val_steps is not 0
if self.trainer.num_sanity_val_steps:
self.trainer.num_sanity_val_steps = get_num_microbatches()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also be *= like limit val batches since everything is measured in global batches

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not ignore num_sanity_val_steps, we use it nemo side in a lot of use-cases like doing validation before training/tuning on a finetuned model

we should rather set it to:
self.trainer.num_sanity_val_steps *= get_num_microbatches()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can have that, and avoid the if condition check. Let me modify it. Initially with self.trainer.num_sanity_val_steps = get_num_microbatches() my intention was to do just 1 sanity validation_step. But yeah we can keep it as self.trainer.num_sanity_val_steps *= get_num_microbatches() to be in sync with limit_val_batches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: if num_sanity_val_steps is not set, what is the default value?
just want to ensure we dont get a multiplying by None error


def _enable_nvidia_optimizations(self):
"These optimizations are present in NVIDIA NGC PyTorch Containers"
Expand Down