Skip to content

[Improvement] Optimize retry logic in ShuffleServerGrpcClient#sendShuffleData #339

@xianjingfeng

Description

@xianjingfeng

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Now in org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient#sendShuffleData, it will retry to send to one shuffle server for a long time and fail after reach rss.client.send.check.timeout.ms. Exception as follows:

Timeout: Task[2852_0] failed because 200 blocks can't be sent to shuffle server in 600000 ms.

This will cause that client will not send data to other servers.

How should we improve?

  1. Don't retry in requirePreAllocation and just retry in upper level
  2. Set the default value of rss.client.retry.max to a smaller value, such as 10.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions