-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] - Groupby benchmark - sort shuffle pull based #57014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] - Groupby benchmark - sort shuffle pull based #57014
Conversation
Signed-off-by: Goutam V. <goutam@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the groupby benchmark to conditionally set override_num_blocks for the SORT_SHUFFLE_PULL_BASED strategy. My review focuses on improving the maintainability of this change by addressing a magic number. I've suggested replacing the hardcoded value with a named constant for better readability and easier modification in the future.
|
read to merge? |
Signed-off-by: Goutam V. <goutam@anyscale.com>
yes |
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Josh Kodi <joshkodi@gmail.com>
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com>
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
The release test (aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer. Related issue: DATA-1399 --------- Signed-off-by: Goutam V. <goutam@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
The release test (
aggregate_groups_fixed_size_sort_shuffle_pull_based_column02 column14) for pull based sort shuffle was OOMing. To address this, I reduced the number of blocks to 100 in the read layer.Related issue number
DATA-1399
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Conditionally set override_num_blocks=100 for read_parquet when using SORT_SHUFFLE_PULL_BASED in the groupby benchmark.
release/nightly_tests/dataset/groupby_benchmark.py:shuffle_strategyisSORT_SHUFFLE_PULL_BASED, callray.data.read_parquet(..., override_num_blocks=100); otherwise leave unset.groupby(args.group_by)flow unchanged.Written by Cursor Bugbot for commit e567de5. This will update automatically on new commits. Configure here.