Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Sep 29, 2025

Why are these changes needed?

The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue number

DATA-1399

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Conditionally set override_num_blocks=100 for read_parquet when using SORT_SHUFFLE_PULL_BASED in the groupby benchmark.

  • Benchmarks (nightly)
    • release/nightly_tests/dataset/groupby_benchmark.py:
      • When shuffle_strategy is SORT_SHUFFLE_PULL_BASED, call ray.data.read_parquet(..., override_num_blocks=100); otherwise leave unset.
      • Subsequent groupby(args.group_by) flow unchanged.

Written by Cursor Bugbot for commit e567de5. This will update automatically on new commits. Configure here.

Signed-off-by: Goutam V. <goutam@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the groupby benchmark to conditionally set override_num_blocks for the SORT_SHUFFLE_PULL_BASED strategy. My review focuses on improving the maintainability of this change by addressing a magic number. I've suggested replacing the hardcoded value with a named constant for better readability and easier modification in the future.

cursor[bot]

This comment was marked as outdated.

@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [WIP][Data] - Groupby benchmark - sort shuffle pull based [Data] - Groupby benchmark - sort shuffle pull based Sep 29, 2025
Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added go add ONLY when ready to merge, run all tests data Ray Data-related issues labels Sep 29, 2025
@aslonnie
Copy link
Collaborator

read to merge?

Signed-off-by: Goutam V. <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

read to merge?

yes

@aslonnie aslonnie merged commit 08c0d31 into ray-project:master Sep 30, 2025
6 checks passed
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.

Related issue: DATA-1399

---------

Signed-off-by: Goutam V. <goutam@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants