Skip to content

Conversation

@justinvyu
Copy link
Contributor

Summary

#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation.

File "/.../runner.py", line 205, in _validate_epoch
    return {"validation/loss": total_loss.item() / num_rows}
ZeroDivisionError: float division by zero

This happened because drop_last=True became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting drop_last=False.

Validation dataset size = 50_000
Num train workers * num dataloader workers = 16 * 16 = 256
Num rows per dataloader worker = 50_000 // 256 = 195
Validation batch size = 256

195 < 256 !! Cannot construct a single full batch since each dataloader worker operates on < batch size.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the ZeroDivisionError by setting drop_last=False for the PyTorch DataLoader. This prevents the validation dataloader from becoming empty when the dataset size isn't a multiple of the batch size distributed across workers. The addition of an assert in runner.py is also a good defensive measure to ensure the validation set is not empty. My main feedback is regarding a subtle issue that this change exposes: the calculation of num_rows for validation loss now becomes inaccurate for the last batch. I've left a specific comment on how to address this. Overall, this is a good fix for the reported bug.

self._metrics["validation/rows_processed"].add(
self.benchmark_config.dataloader_config.validation_batch_size
)
assert num_rows > 0, "Validation dataset yielded no batches."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion is a good safeguard against the ZeroDivisionError. However, the change to drop_last=False in torch_dataloader_factory.py surfaces a subtle bug in the calculation of num_rows. On line 200, num_rows is incremented by the fixed validation_batch_size. With drop_last=False, the last batch may be partial, causing num_rows to be overestimated and the calculated validation/loss to be incorrect. To fix this, you should use the actual size of the batch. For example, you could change line 200 to num_rows += batch[0].size(0) (assuming batch is a tuple where the first element is the input tensor). A similar issue, already noted with a TODO, exists in _train_epoch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but I have it as a todo already in the code. It's a bit hard to get the real size of the batch (it's usually batch_size except for the last one) because the collated format is specific to the task right now (image vs. torchrec training).

prefetch_factor=prefetch_factor,
timeout=timeout,
drop_last=True,
drop_last=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change this elsewhere in train_tests/benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just be here as a torch dataloader specific change

self._metrics["validation/rows_processed"].add(
self.benchmark_config.dataloader_config.validation_batch_size
)
assert num_rows > 0, "Validation dataset yielded no batches."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real bug?

@justinvyu justinvyu enabled auto-merge (squash) September 10, 2025 00:01
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 10, 2025
@justinvyu justinvyu merged commit 6ff42bf into ray-project:master Sep 10, 2025
6 checks passed
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Sep 10, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants