[Data] Switched default shuffle strategy from sort-based to hash-based #55510

alexeykudinkin · 2025-08-12T00:45:41Z

Why are these changes needed?

Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests.

Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request changes the default shuffle strategy in Ray Data from a sort-based approach to a hash-based one. This is a significant change to the default behavior that could have wide-ranging performance implications. While reviewing this change, I identified a critical bug that would cause a crash if a user tried to configure the shuffle strategy using the RAY_DATA_DEFAULT_SHUFFLE_STRATEGY environment variable. I've provided a fix for this issue in my review comments.

python/ray/data/context.py

goutamvenkat-anyscale · 2025-08-14T08:26:38Z

python/ray/data/dataset.py

-        if num_partitions is not None and num_partitions <= 0:
+        if num_partitions is None:
+            # TODO replace w/ size-based estimate
+            num_partitions = self._logical_plan.dag.estimated_num_outputs()


So this is a heuristic of the number of blocks outputted by this groupby step?
A couple of questions/concerns

I'm a little worried this might result in extreme partition values (too small or too large, especially if a repartition precedes this operation)

It seems estimated_num_outputs() can return None if _num_outputs is not set. What happens in this case?

goutamvenkat-anyscale · 2025-08-14T08:31:27Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        #   - 5% of total available CPUs but
+        #   - No more than 4 CPUs per aggregator
+        #
+        return min(4.0, total_available_cluster_resources.cpu * 0.05 / num_aggregators)


Just curious how we arrived at these defaults. Is there a simulation?

python/ray/data/context.py

github-actions · 2025-08-29T12:25:36Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…to `ray.cluster_resources()` when no cluster-configuration is available Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

… by # of CPUs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ior stage is known Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated fixtures Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

ray-project#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ray-project#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

ray-project#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

ray-project#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

ray-project#55510)   ## Why are these changes needed? Hash-based shuffle has been around for some time now bringing clear performance advantages in our internal benchmarks and tests. Therefore we're switching default shuffle-strategy from existing (legacy) range-sort based one to a hash-shuffle. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

alexeykudinkin requested a review from a team as a code owner August 12, 2025 00:45

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Aug 12, 2025

alexeykudinkin requested a review from goutamvenkat-anyscale August 12, 2025 00:45

alexeykudinkin changed the title ~~[Data] Swapped default shuffle strategy from sort-based to hash-based~~ [Data] Switched default shuffle strategy from sort-based to hash-based Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

python/ray/data/context.py Show resolved Hide resolved

goutamvenkat-anyscale approved these changes Aug 12, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) August 12, 2025 00:58

github-actions bot disabled auto-merge August 12, 2025 23:13

alexeykudinkin force-pushed the ak/hsh-shfl-def branch from 0e439c5 to 4473775 Compare August 13, 2025 22:22

goutamvenkat-anyscale reviewed Aug 14, 2025

View reviewed changes

python/ray/data/context.py Outdated Show resolved Hide resolved

ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Aug 15, 2025

alexeykudinkin mentioned this pull request Sep 18, 2025

[fix][data]Ray Data's shuffle operation should use hash-shuffle as the default algorithm #56712

Closed

8 tasks

alexeykudinkin force-pushed the ak/hsh-shfl-def branch 2 times, most recently from dae14de to b197f5f Compare September 19, 2025 02:57

alexeykudinkin enabled auto-merge (squash) September 19, 2025 02:58

github-actions bot disabled auto-merge September 19, 2025 06:25

alexeykudinkin enabled auto-merge (squash) September 19, 2025 06:26

github-actions bot disabled auto-merge September 19, 2025 06:46

alexeykudinkin force-pushed the ak/hsh-shfl-def branch from 80765dd to 3619ede Compare September 19, 2025 18:49

alexeykudinkin enabled auto-merge (squash) September 19, 2025 18:49

github-actions bot disabled auto-merge September 19, 2025 18:49

alexeykudinkin added 2 commits September 19, 2025 19:05

Swapped default shuffle strategy from sort-based to hash-based

e9e8469

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited resource allocation for hash-shuffle operation to fallback …

153ef43

…to `ray.cluster_resources()` when no cluster-configuration is available Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions bot disabled auto-merge September 20, 2025 05:59

Updated test fixture

31cef95

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin enabled auto-merge (squash) September 20, 2025 19:44

This comment was marked as outdated.

Sign in to view

alexeykudinkin added 4 commits September 20, 2025 15:20

Fixed derivation of max num of hash-shuffle aggregators to be bounded…

da42fa6

… by # of CPUs Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Improved dataset estimation in cases when number of outputs of the pr…

16d6f16

…ior stage is known Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added new test;

53ac624

Updated fixtures Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

3073e6b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin disabled auto-merge September 20, 2025 22:48

alexeykudinkin enabled auto-merge (squash) September 20, 2025 22:48

This comment was marked as outdated.

Sign in to view

Reverted estimation based on the number of outputs

f902b2f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions bot disabled auto-merge September 21, 2025 04:49

alexeykudinkin enabled auto-merge (squash) September 21, 2025 04:49

alexeykudinkin added 5 commits September 21, 2025 09:22

Improving logging

33c47d4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Lowered default fallback to 1Gb

515e01e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revised shuffle task memory allocation

dadfa44

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

f2df2c5

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated fixtures

087ec07

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions bot disabled auto-merge September 21, 2025 16:39

alexeykudinkin merged commit 28fb6c3 into ray-project:master Sep 21, 2025
6 checks passed

alexeykudinkin linked an issue Sep 22, 2025 that may be closed by this pull request

[Data] Make hash-shuffle default shuffle algorithm #56704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Switched default shuffle strategy from sort-based to hash-based #55510

[Data] Switched default shuffle strategy from sort-based to hash-based #55510

Uh oh!

alexeykudinkin commented Aug 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

goutamvenkat-anyscale Aug 14, 2025 •

edited

Loading

Uh oh!

goutamvenkat-anyscale Aug 14, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Data] Switched default shuffle strategy from sort-based to hash-based #55510

[Data] Switched default shuffle strategy from sort-based to hash-based #55510

Uh oh!

Conversation

alexeykudinkin commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

goutamvenkat-anyscale Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexeykudinkin commented Aug 12, 2025 •

edited

Loading

goutamvenkat-anyscale Aug 14, 2025 •

edited

Loading