[Data] - Groupby benchmark - sort shuffle pull based #57014

goutamvenkat-anyscale · 2025-09-29T20:12:51Z

Why are these changes needed?

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue number

DATA-1399

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Conditionally set override_num_blocks=100 for read_parquet when using SORT_SHUFFLE_PULL_BASED in the groupby benchmark.

Benchmarks (nightly)
- release/nightly_tests/dataset/groupby_benchmark.py:
  - When shuffle_strategy is SORT_SHUFFLE_PULL_BASED, call ray.data.read_parquet(..., override_num_blocks=100); otherwise leave unset.
  - Subsequent groupby(args.group_by) flow unchanged.

^{Written by Cursor Bugbot for commit e567de5. This will update automatically on new commits. Configure here.}

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request updates the groupby benchmark to conditionally set override_num_blocks for the SORT_SHUFFLE_PULL_BASED strategy. My review focuses on improving the maintainability of this change by addressing a magic number. I've suggested replacing the hardcoded value with a named constant for better readability and easier modification in the future.

release/nightly_tests/dataset/groupby_benchmark.py

Signed-off-by: Goutam V. <goutam@anyscale.com>

release/nightly_tests/dataset/groupby_benchmark.py

aslonnie · 2025-09-30T03:39:15Z

read to merge?

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-09-30T05:47:54Z

read to merge?

yes

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com>

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

[WIP][Data] - Groupby benchmark - sort shuffle pull based

564d354

Signed-off-by: Goutam V. <goutam@anyscale.com>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

release/nightly_tests/dataset/groupby_benchmark.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

goutamvenkat-anyscale changed the title ~~[WIP][Data] - Groupby benchmark - sort shuffle pull based~~ [Data] - Groupby benchmark - sort shuffle pull based Sep 29, 2025

Fix bug

a17c862

Signed-off-by: Goutam V. <goutam@anyscale.com>

goutamvenkat-anyscale added go add ONLY when ready to merge, run all tests data Ray Data-related issues labels Sep 29, 2025

alexeykudinkin reviewed Sep 30, 2025

View reviewed changes

release/nightly_tests/dataset/groupby_benchmark.py Show resolved Hide resolved

alexeykudinkin approved these changes Sep 30, 2025

View reviewed changes

TODO comment

e567de5

Signed-off-by: Goutam V. <goutam@anyscale.com>

aslonnie merged commit 08c0d31 into ray-project:master Sep 30, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] - Groupby benchmark - sort shuffle pull based #57014

[Data] - Groupby benchmark - sort shuffle pull based #57014

goutamvenkat-anyscale commented Sep 29, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

aslonnie commented Sep 30, 2025

Uh oh!

goutamvenkat-anyscale commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Data] - Groupby benchmark - sort shuffle pull based #57014

[Data] - Groupby benchmark - sort shuffle pull based #57014

Conversation

goutamvenkat-anyscale commented Sep 29, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

aslonnie commented Sep 30, 2025

Uh oh!

goutamvenkat-anyscale commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Sep 29, 2025 •

edited by cursor bot

Loading