[release-test] Disable `drop_last` flag to fix division by zero in torch dataloader baselines #56395

justinvyu · 2025-09-09T22:37:48Z

Summary

#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation.

File "/.../runner.py", line 205, in _validate_epoch
    return {"validation/loss": total_loss.item() / num_rows}
ZeroDivisionError: float division by zero

This happened because drop_last=True became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting drop_last=False.

Validation dataset size = 50_000
Num train workers * num dataloader workers = 16 * 16 = 256
Num rows per dataloader worker = 50_000 // 256 = 195
Validation batch size = 256

195 < 256 !! Cannot construct a single full batch since each dataloader worker operates on < batch size.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2025-09-09T22:38:51Z

https://buildkite.com/ray-project/release/builds/57472

gemini-code-assist

Code Review

This pull request effectively addresses the ZeroDivisionError by setting drop_last=False for the PyTorch DataLoader. This prevents the validation dataloader from becoming empty when the dataset size isn't a multiple of the batch size distributed across workers. The addition of an assert in runner.py is also a good defensive measure to ensure the validation set is not empty. My main feedback is regarding a subtle issue that this change exposes: the calculation of num_rows for validation loss now becomes inaccurate for the last batch. I've left a specific comment on how to address this. Overall, this is a good fix for the reported bug.

gemini-code-assist · 2025-09-09T22:39:14Z

release/train_tests/benchmark/runner.py

            self._metrics["validation/rows_processed"].add(
                self.benchmark_config.dataloader_config.validation_batch_size
            )
+        assert num_rows > 0, "Validation dataset yielded no batches."


This assertion is a good safeguard against the ZeroDivisionError. However, the change to drop_last=False in torch_dataloader_factory.py surfaces a subtle bug in the calculation of num_rows. On line 200, num_rows is incremented by the fixed validation_batch_size. With drop_last=False, the last batch may be partial, causing num_rows to be overestimated and the calculated validation/loss to be incorrect. To fix this, you should use the actual size of the batch. For example, you could change line 200 to num_rows += batch[0].size(0) (assuming batch is a tuple where the first element is the input tensor). A similar issue, already noted with a TODO, exists in _train_epoch.

Is this a real bug?

yeah but I have it as a todo already in the code. It's a bit hard to get the real size of the batch (it's usually batch_size except for the last one) because the collated format is specific to the task right now (image vs. torchrec training).

TimothySeah · 2025-09-09T23:55:18Z

release/train_tests/benchmark/torch_dataloader_factory.py

            prefetch_factor=prefetch_factor,
            timeout=timeout,
-            drop_last=True,
+            drop_last=False,


Do we need to change this elsewhere in train_tests/benchmark?

should just be here as a torch dataloader specific change

TimothySeah · 2025-09-09T23:55:50Z

release/train_tests/benchmark/runner.py

            self._metrics["validation/rows_processed"].add(
                self.benchmark_config.dataloader_config.validation_batch_size
            )
+        assert num_rows > 0, "Validation dataset yielded no batches."


Is this a real bug?

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 2 commits September 9, 2025 15:23

do not drop the last batch for torch DL

84136c6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Add assertion

54ea0c6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

TimothySeah approved these changes Sep 9, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) September 10, 2025 00:01

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 10, 2025

justinvyu merged commit 6ff42bf into ray-project:master Sep 10, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release-test] Disable `drop_last` flag to fix division by zero in torch dataloader baselines #56395

[release-test] Disable `drop_last` flag to fix division by zero in torch dataloader baselines #56395

Uh oh!

justinvyu commented Sep 9, 2025

Uh oh!

justinvyu commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Uh oh!

TimothySeah Sep 9, 2025

Uh oh!

justinvyu Sep 9, 2025

Uh oh!

TimothySeah Sep 9, 2025

Uh oh!

justinvyu Sep 10, 2025

Uh oh!

TimothySeah Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines #56395

[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines #56395

Uh oh!

Conversation

justinvyu commented Sep 9, 2025

Summary

Uh oh!

justinvyu commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[release-test] Disable `drop_last` flag to fix division by zero in torch dataloader baselines #56395

[release-test] Disable `drop_last` flag to fix division by zero in torch dataloader baselines #56395