-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines
#56395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines
#56395
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the ZeroDivisionError by setting drop_last=False for the PyTorch DataLoader. This prevents the validation dataloader from becoming empty when the dataset size isn't a multiple of the batch size distributed across workers. The addition of an assert in runner.py is also a good defensive measure to ensure the validation set is not empty. My main feedback is regarding a subtle issue that this change exposes: the calculation of num_rows for validation loss now becomes inaccurate for the last batch. I've left a specific comment on how to address this. Overall, this is a good fix for the reported bug.
| self._metrics["validation/rows_processed"].add( | ||
| self.benchmark_config.dataloader_config.validation_batch_size | ||
| ) | ||
| assert num_rows > 0, "Validation dataset yielded no batches." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion is a good safeguard against the ZeroDivisionError. However, the change to drop_last=False in torch_dataloader_factory.py surfaces a subtle bug in the calculation of num_rows. On line 200, num_rows is incremented by the fixed validation_batch_size. With drop_last=False, the last batch may be partial, causing num_rows to be overestimated and the calculated validation/loss to be incorrect. To fix this, you should use the actual size of the batch. For example, you could change line 200 to num_rows += batch[0].size(0) (assuming batch is a tuple where the first element is the input tensor). A similar issue, already noted with a TODO, exists in _train_epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a real bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but I have it as a todo already in the code. It's a bit hard to get the real size of the batch (it's usually batch_size except for the last one) because the collated format is specific to the task right now (image vs. torchrec training).
| prefetch_factor=prefetch_factor, | ||
| timeout=timeout, | ||
| drop_last=True, | ||
| drop_last=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change this elsewhere in train_tests/benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just be here as a torch dataloader specific change
| self._metrics["validation/rows_processed"].add( | ||
| self.benchmark_config.dataloader_config.validation_batch_size | ||
| ) | ||
| assert num_rows > 0, "Validation dataset yielded no batches." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a real bug?
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Summary
#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation.
This happened because
drop_last=Truebecame the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by settingdrop_last=False.