Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Aug 12, 2025

Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner August 12, 2025 00:45
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Aug 12, 2025
@alexeykudinkin alexeykudinkin changed the title [Data] Swapped default shuffle strategy from sort-based to hash-based [Data] Switched default shuffle strategy from sort-based to hash-based Aug 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the default shuffle strategy in Ray Data from a sort-based approach to a hash-based one. This is a significant change to the default behavior that could have wide-ranging performance implications. While reviewing this change, I identified a critical bug that would cause a crash if a user tried to configure the shuffle strategy using the RAY_DATA_DEFAULT_SHUFFLE_STRATEGY environment variable. I've provided a fix for this issue in my review comments.

@alexeykudinkin alexeykudinkin enabled auto-merge (squash) August 12, 2025 00:58
@github-actions github-actions bot disabled auto-merge August 12, 2025 23:13
if num_partitions is not None and num_partitions <= 0:
if num_partitions is None:
# TODO replace w/ size-based estimate
num_partitions = self._logical_plan.dag.estimated_num_outputs()
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a heuristic of the number of blocks outputted by this groupby step?
A couple of questions/concerns

  1. I'm a little worried this might result in extreme partition values (too small or too large, especially if a repartition precedes this operation)
  2. It seems estimated_num_outputs() can return None if _num_outputs is not set. What happens in this case?

# - 5% of total available CPUs but
# - No more than 4 CPUs per aggregator
#
return min(4.0, total_available_cluster_resources.cpu * 0.05 / num_aggregators)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious how we arrived at these defaults. Is there a simulation?

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Aug 15, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added stale The issue is stale. It will be closed within 7 days unless there are further conversation unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Aug 29, 2025
@alexeykudinkin alexeykudinkin force-pushed the ak/hsh-shfl-def branch 2 times, most recently from dae14de to b197f5f Compare September 19, 2025 02:57
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 19, 2025 02:58
@github-actions github-actions bot disabled auto-merge September 19, 2025 06:25
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 19, 2025 06:26
@github-actions github-actions bot disabled auto-merge September 19, 2025 06:46
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 19, 2025 18:49
@github-actions github-actions bot disabled auto-merge September 19, 2025 18:49
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…to `ray.cluster_resources()` when no cluster-configuration is available

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@github-actions github-actions bot disabled auto-merge September 20, 2025 05:59
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 20, 2025 19:44
cursor[bot]

This comment was marked as outdated.

… by # of CPUs

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ior stage is known

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Updated fixtures

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 20, 2025 22:48
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@github-actions github-actions bot disabled auto-merge September 21, 2025 04:49
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 21, 2025 04:49
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@github-actions github-actions bot disabled auto-merge September 21, 2025 16:39
@alexeykudinkin alexeykudinkin merged commit 28fb6c3 into ray-project:master Sep 21, 2025
6 checks passed
@alexeykudinkin alexeykudinkin linked an issue Sep 22, 2025 that may be closed by this pull request
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Sep 24, 2025
#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
ray-project#55510)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear
performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing
(legacy) range-sort based one to a hash-shuffle.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Make hash-shuffle default shuffle algorithm

2 participants