Skip to content

Conversation

@iamjustinhsu
Copy link
Contributor

Description

Fixes #58603. map_group assumes that all partitions fix in one block. However, we my PR broke this behavior by breaking down blocks. To address this, I added a data context variable that is implicitly set to False to breakdown blocks. If True, block size will be preserved

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner November 25, 2025 23:29
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The PR introduces a new context flag _preserve_hash_shuffle_finalize_blocks to prevent block breakdown during hash shuffle finalization, which is necessary for map_groups. The overall approach is sound, but I've found a critical logic issue in the implementation and some inconsistencies in the documentation and tests that need to be addressed.

  1. There's a logical error in hash_shuffle.py where an or is used instead of an and, which could cause the fix to not work as intended when target_max_block_size is set.
  2. The documentation and comments for the new flag in context.py contradict the implementation's behavior for map_groups.
  3. The new test for map_groups has a misleading name and comments.

I've left specific comments with suggestions for these points. Once these are addressed, the PR should be in good shape.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
aggregator_ray_remote_args_override: Optional[Dict[str, Any]] = None,
shuffle_progress_bar_name: Optional[str] = None,
finalize_progress_bar_name: Optional[str] = None,
preserve_finalize_blocks: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disallow_block_splitting

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ctx = DataContext.get_current()
# Very small to force splitting if enabled
ctx.target_max_block_size = 1
yield
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test fixture doesn't restore modified DataContext value

The setup fixture modifies ctx.target_max_block_size = 1 but never restores the original value after the test. The established pattern in conftest.py for such fixtures is to save the original value before modifying, then restore it after yield. Without restoration, the modified value of 1 persists and could cause test pollution affecting subsequent tests that depend on the default target_max_block_size value.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 26, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@srinathk10 srinathk10 added the go add ONLY when ready to merge, run all tests label Nov 26, 2025
Comment on lines 1582 to 1596
target_max_block_size = self._data_context.target_max_block_size
# None means the user wants to preserve the block distribution,
# so we do not break the block down further.
if target_max_block_size is not None:
# Also check _disallow_block_splitting parameter.
if target_max_block_size is not None and not self._disallow_block_splitting:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead pass target_max_block_size as None when block splitting not allowed

Comment on lines 1233 to 1235
# NOTE: This is set to True because num_partitions (aka, # of output blocks)
# must be preserved.
disallow_block_splitting=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: This is set to True because num_partitions (aka, # of output blocks)
# must be preserved.
disallow_block_splitting=True,
# NOTE: In cases like ``groupby`` blocks can't be split as this might violate an invariant that all rows
# with the same key are in the same group (block)
disallow_block_splitting=True,

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Comment on lines 1309 to 1311
self._data_context = data_context.copy()
if disallow_block_splitting:
self._data_context.target_max_block_size = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i should have been more clear:

  • Don't patch the DC
  • Instead pass target_max_block_size directly into the aggregator (to avoid passing to overlapping configs DC and disallowing flag)

…llow_block_splitting=True

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@alexeykudinkin alexeykudinkin merged commit 94ef5ff into ray-project:master Dec 2, 2025
6 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/preserve-block-size branch December 2, 2025 21:02
bveeramani added a commit that referenced this pull request Dec 8, 2025
## Description

`test_preserve_hash_shuffle_blocks` has been flaking consistently. To
mitigate the flakiness, this PR bumps the test size from "small" to
"medium".

```
[2025-12-06T07:06:32Z] //python/ray/data:test_preserve_hash_shuffle_blocks                     TIMEOUT in 3 out of 3 in 63.4s
--
[2025-12-06T07:06:32Z]   Stats over 3 runs: max = 63.4s, min = 60.1s, avg = 62.3s, dev = 1.6s
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_1.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_2.log
[2025-12-06T07:06:32Z]
```

See also #58988

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
bveeramani added a commit that referenced this pull request Dec 12, 2025
## Description

`test_preserve_hash_shuffle_blocks` has been flaking consistently. To
mitigate the flakiness, this PR bumps the test size from "small" to
"medium".

This is a follow-up to #59256,
which accidentally bumped the wrong test.

```
[2025-12-06T07:06:32Z] //python/ray/data:test_preserve_hash_shuffle_blocks                     TIMEOUT in 3 out of 3 in 63.4s
--
[2025-12-06T07:06:32Z]   Stats over 3 runs: max = 63.4s, min = 60.1s, avg = 62.3s, dev = 1.6s
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_1.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_2.log
[2025-12-06T07:06:32Z]
```

See also #58988

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
…ect#59399)

## Description

`test_preserve_hash_shuffle_blocks` has been flaking consistently. To
mitigate the flakiness, this PR bumps the test size from "small" to
"medium".

This is a follow-up to ray-project#59256,
which accidentally bumped the wrong test.

```
[2025-12-06T07:06:32Z] //python/ray/data:test_preserve_hash_shuffle_blocks                     TIMEOUT in 3 out of 3 in 63.4s
--
[2025-12-06T07:06:32Z]   Stats over 3 runs: max = 63.4s, min = 60.1s, avg = 62.3s, dev = 1.6s
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_1.log
[2025-12-06T07:06:32Z]   /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray/bazel-out/k8-opt/testlogs/python/ray/data/test_preserve_hash_shuffle_blocks/test_attempts/attempt_2.log
[2025-12-06T07:06:32Z]
```

See also ray-project#58988

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants