Code of Conduct
Search before asking
What would you like to be improved?
Now in org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient#sendShuffleData, it will retry to send to one shuffle server for a long time and fail after reach rss.client.send.check.timeout.ms. Exception as follows:
Timeout: Task[2852_0] failed because 200 blocks can't be sent to shuffle server in 600000 ms.
This will cause that client will not send data to other servers.
How should we improve?
- Don't retry in
requirePreAllocation and just retry in upper level
- Set the default value of
rss.client.retry.max to a smaller value, such as 10.
Are you willing to submit PR?