[ISSUE-309][FEATURE] Support ShuffleServer latency metrics.#327
[ISSUE-309][FEATURE] Support ShuffleServer latency metrics.#327roryqi merged 7 commits intoapache:masterfrom
Conversation
| return counterGrpcTotal; | ||
| } | ||
|
|
||
| public Map<String, Summary> getSendTimeSummaryMap() { |
There was a problem hiding this comment.
What's the meaning of sendTime?
There was a problem hiding this comment.
Could we have a better name? Transport time?
There was a problem hiding this comment.
Its meaning is the time interval from the client sending to the ShuffleServerGrpcService receiving the request.
There was a problem hiding this comment.
Could we have a better name? Transport time?
Sounds great!
| long requireBufferId = req.getRequireBufferId(); | ||
| long sendTime = req.getSendTime(); | ||
| if (sendTime > 0) { | ||
| shuffleServer.getGrpcMetrics().recordSendTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD, |
There was a problem hiding this comment.
Do we need consider the data size when we calculate the metrics?
There was a problem hiding this comment.
I don't think the amount of data will cause great fluctuations in latency. For example, 100K costs 1ms, and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high (according to observations in the production environment) , Of course, if we consider the amount of data, we can divide the sending time by a certain amount of data. Do you have any better suggestions?
There was a problem hiding this comment.
Make sense. Could we add some comments to explain why we don't choose to use the size of data?
Codecov Report
@@ Coverage Diff @@
## master #327 +/- ##
============================================
- Coverage 61.21% 61.08% -0.14%
- Complexity 1506 1507 +1
============================================
Files 185 185
Lines 9360 9405 +45
Branches 908 914 +6
============================================
+ Hits 5730 5745 +15
- Misses 3325 3355 +30
Partials 305 305
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…uniffle into latency_metrics merge.
| transportTimeSummaryMap.putIfAbsent(GET_SHUFFLE_DATA_METHOD, | ||
| metricsManager.addSummary(GRPC_GET_SHUFFLE_DATA_SEND_LATENCY)); | ||
| transportTimeSummaryMap.putIfAbsent(GET_IN_MEMORY_SHUFFLE_DATA_METHOD, | ||
| metricsManager.addSummary(GRPC_GET_IN_MEMORY_SHUFFLE_DATA_SEND_LATENCY)); |
proto/src/main/proto/Rss.proto
Outdated
| int32 partitionNum = 5; | ||
| int64 offset = 6; | ||
| int32 length = 7; | ||
| int64 sendTime = 8; |
There was a problem hiding this comment.
Could we give a better name?
proto/src/main/proto/Rss.proto
Outdated
| int32 partitionId = 3; | ||
| int64 lastBlockId = 4; | ||
| int32 readBufferSize = 5; | ||
| int64 sendTime = 6; |
There was a problem hiding this comment.
Time is a duration. This should be timestamp.
| * The amount of data will not cause great fluctuations in latency. For example, 100K costs 1ms, | ||
| * and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high. | ||
| * */ | ||
| shuffleServer.getGrpcMetrics().recordTransportTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD, |
There was a problem hiding this comment.
System.currentTimeMills() - sendTime may be less than 0, because they are generated from different machines.
There was a problem hiding this comment.
Whether does the negative number influence our metrics?
There was a problem hiding this comment.
Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics
There was a problem hiding this comment.
Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics
OK. actually we should have a document to tell users that how to use the metrics and what to notice. But we don't have such documents, so we can only add some comments.
What changes were proposed in this pull request?
For #309, support ShuffleServer latency metrics.
Why are the changes needed?
Accurately determine whether the current service load has caused a large delay to the client's read and write.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT