Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Why are these changes needed?

Cherry-pick of #57572

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner October 8, 2025 22:12
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve the robustness of the hash shuffle operation by enabling indefinite retries for its constituent tasks. The changes correctly apply max_retries=-1 for the shuffle map tasks and max_task_retries=-1 for the aggregator actor tasks, which is a solid approach. However, I've identified one high-severity issue where a fixed timeout on a ray.get() call could undermine the benefit of indefinite retries. Please see my detailed comment below.

@aslonnie aslonnie merged commit 276c75c into releases/2.50.0 Oct 8, 2025
4 of 6 checks passed
@aslonnie aslonnie deleted the ak/hsh-shfl-max-rtr-fix-cp branch October 8, 2025 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants