-
Notifications
You must be signed in to change notification settings - Fork 169
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the feature
When the ShuffleServer load is high, we cannot directly judge whether the client read and write has been greatly affected according to the metrics.
Motivation
Accurately determine whether the current service load has caused a large delay to the client's read and write.
Describe the solution
Delay monitoring is divided into two parts. The first part is the delay of ShuffleServer processing logic. Here we can directly add metrics. The second part is before ShuffleServer processing logic, including network delay and rpc queue waiting time.
For the second part, maybe we can record the timestamp of the request before the client initiates the read and write request, and include this timestamp in the request. When ShuffleServer receives the request it can know how long the delay time is and record it in the metrics of ShuffleServer, maybe grpc also supports related implementations.
We can measure the processing delay of the current ShuffleServer through some monitoring indicators such as p95 and p99.
Additional context
No
Are you willing to submit PR?
- Yes I am willing to submit a PR!