[Data] Sample finalized partitions randomly to avoid lensing finalization on a single node #58456

alexeykudinkin · 2025-11-07T18:27:41Z

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window).

This creates a lensing effect since:

Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators)
Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence)

To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request refactors the finalization logic in the hash shuffle operator to randomly sample partitions instead of processing them sequentially. This is a valuable change that should help distribute the finalization load more evenly across the cluster and avoid potential node hotspots. My review includes one suggestion to optimize the new sampling logic for better performance.

gemini-code-assist · 2025-11-07T18:29:53Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        target_partition_ids = random.sample(
+            list(self._pending_finalization_partition_ids), next_batch_size
        )


The random.sample function can operate directly on sets, so converting self._pending_finalization_partition_ids to a list is unnecessary. Removing the list() conversion will improve performance by avoiding the creation of a new list in each call, which can be expensive if the number of pending partitions is large.

Suggested change

target_partition_ids = random.sample(

list(self._pending_finalization_partition_ids), next_batch_size

)

target_partition_ids = random.sample(

self._pending_finalization_partition_ids, next_batch_size

)

iamjustinhsu · 2025-11-07T19:31:54Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        # and avoid effect of "sliding lense" effect where we finalize the batch of
+        # N *adjacent* partitions that may be co-located on the same node:
+        #
+        #   - Adjacent partitions i and i+1 are handled by adjacent


~~wait is this true? if module N = num actors, then partition i and i + 1 must necessarily be in different actors.~~ Oh wait nvm, i see what your saying

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

iamjustinhsu · 2025-11-07T20:58:56Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        #   - Adjacent aggregators have high likelihood of running on the
+        #   same node (when num aggregators > num nodes)


is this necessarily true? your default strategy is spread, and each aggregator is scheduled with same num of resources, so aggregator i and i + 1 have as much of a chance of scheduling on the same node as aggregator i and j. please correct my assumptions if im wrong

goutamvenkat-anyscale · 2025-11-07T21:10:49Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        #
+        # NOTE: This doesn't affect determinism, since this only impacts order
+        #       of finalization (hence not required to be seeded)
+        target_partition_ids = random.sample(


So wouldn't a better strategy be to check how much each agg actor is currently consuming relative to the node's capacity and schedule the finalization if there's remaining capacity?

I just find the randomization strategy harder to reason in this case.

Also it's a function of partition size, so ideally if we can get metadata about the partition before scheduling the finalize() that would be even better.

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

alexeykudinkin added 4 commits November 7, 2025 08:55

Tidying up

70e409d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Random sample partitions for finalization

3eaf98b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

2a6994f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

04b4545

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner November 7, 2025 18:27

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Nov 7, 2025

iamjustinhsu reviewed Nov 7, 2025

View reviewed changes

iamjustinhsu approved these changes Nov 7, 2025

View reviewed changes

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Nov 7, 2025

Fixed typo

c4bf956

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin enabled auto-merge (squash) November 7, 2025 20:02

lint

e9de75b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

github-actions bot disabled auto-merge November 7, 2025 20:33

iamjustinhsu reviewed Nov 7, 2025

View reviewed changes

goutamvenkat-anyscale reviewed Nov 7, 2025

View reviewed changes

alexeykudinkin merged commit f4d10b8 into master Nov 7, 2025
6 checks passed

alexeykudinkin deleted the ak/jn-rnd-fix branch November 7, 2025 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Sample finalized partitions randomly to avoid lensing finalization on a single node #58456

[Data] Sample finalized partitions randomly to avoid lensing finalization on a single node #58456

Uh oh!

alexeykudinkin commented Nov 7, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

iamjustinhsu Nov 7, 2025 •

edited

Loading

Uh oh!

iamjustinhsu Nov 7, 2025 •

edited

Loading

Uh oh!

goutamvenkat-anyscale Nov 7, 2025

Uh oh!

goutamvenkat-anyscale Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# - Adjacent aggregators have high likelihood of running on the
		# same node (when num aggregators > num nodes)

[Data] Sample finalized partitions randomly to avoid lensing finalization on a single node #58456

[Data] Sample finalized partitions randomly to avoid lensing finalization on a single node #58456

Uh oh!

Conversation

alexeykudinkin commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexeykudinkin commented Nov 7, 2025 •

edited

Loading

iamjustinhsu Nov 7, 2025 •

edited

Loading

iamjustinhsu Nov 7, 2025 •

edited

Loading