[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. by leixm · Pull Request #327 · apache/uniffle

leixm · 2022-11-16T02:56:02Z

What changes were proposed in this pull request?

For #309, support ShuffleServer latency metrics.

Why are the changes needed?

Accurately determine whether the current service load has caused a large delay to the client's read and write.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

roryqi · 2022-11-16T03:16:17Z

common/src/main/java/org/apache/uniffle/common/metrics/GRPCMetrics.java

    return counterGrpcTotal;
  }
+
+  public Map<String, Summary> getSendTimeSummaryMap() {


What's the meaning of sendTime?

Could we have a better name? Transport time?

Its meaning is the time interval from the client sending to the ShuffleServerGrpcService receiving the request.

Could we have a better name? Transport time?

Sounds great!

roryqi · 2022-11-16T03:18:17Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

    long requireBufferId = req.getRequireBufferId();
+    long sendTime = req.getSendTime();
+    if (sendTime > 0) {
+      shuffleServer.getGrpcMetrics().recordSendTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,


Do we need consider the data size when we calculate the metrics?

I don't think the amount of data will cause great fluctuations in latency. For example, 100K costs 1ms, and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high (according to observations in the production environment) , Of course, if we consider the amount of data, we can divide the sending time by a certain amount of data. Do you have any better suggestions?

Make sense. Could we add some comments to explain why we don't choose to use the size of data?

codecov-commenter · 2022-11-16T03:19:44Z

Codecov Report

Merging #327 (55719de) into master (eae2621) will decrease coverage by 0.13%.
The diff coverage is 31.91%.

@@             Coverage Diff              @@
##             master     #327      +/-   ##
============================================
- Coverage     61.21%   61.08%   -0.14%     
- Complexity     1506     1507       +1     
============================================
  Files           185      185              
  Lines          9360     9405      +45     
  Branches        908      914       +6     
============================================
+ Hits           5730     5745      +15     
- Misses         3325     3355      +30     
  Partials        305      305

Impacted Files	Coverage Δ
...java/org/apache/uniffle/common/util/Constants.java	`0.00% <ø> (ø)`
...pache/uniffle/server/ShuffleServerGrpcService.java	`0.83% <0.00%> (-0.03%)`	⬇️
...org/apache/uniffle/common/metrics/GRPCMetrics.java	`40.00% <16.66%> (-6.52%)`	⬇️
.../apache/uniffle/common/metrics/MetricsManager.java	`68.42% <20.00%> (-17.30%)`	⬇️
...pache/uniffle/server/ShuffleServerGrpcMetrics.java	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…uniffle into latency_metrics merge.

roryqi · 2022-11-16T04:12:05Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcMetrics.java

+    transportTimeSummaryMap.putIfAbsent(GET_SHUFFLE_DATA_METHOD,
+        metricsManager.addSummary(GRPC_GET_SHUFFLE_DATA_SEND_LATENCY));
+    transportTimeSummaryMap.putIfAbsent(GET_IN_MEMORY_SHUFFLE_DATA_METHOD,
+        metricsManager.addSummary(GRPC_GET_IN_MEMORY_SHUFFLE_DATA_SEND_LATENCY));


SEND -> TRANSPORT?

roryqi · 2022-11-16T04:14:03Z

proto/src/main/proto/Rss.proto

  int32 partitionNum = 5;
  int64 offset = 6;
  int32 length = 7;
+  int64 sendTime = 8;


Could we give a better name?

roryqi · 2022-11-16T04:15:00Z

proto/src/main/proto/Rss.proto

  int32 partitionId = 3;
  int64 lastBlockId = 4;
  int32 readBufferSize = 5;
+  int64 sendTime = 6;


Time is a duration. This should be timestamp.

roryqi · 2022-11-16T06:32:42Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

+      * The amount of data will not cause great fluctuations in latency. For example, 100K costs 1ms,
+      * and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high.
+      * */
+      shuffleServer.getGrpcMetrics().recordTransportTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,


System.currentTimeMills() - sendTime may be less than 0, because they are generated from different machines.

Whether does the negative number influence our metrics?

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

OK. actually we should have a document to tell users that how to use the metrics and what to notice. But we don't have such documents, so we can only add some comments.

roryqi

LGTM, wait for CI, thanks @leixm

leixianming and others added 3 commits November 16, 2022 00:59

[FEATURE] Support ShuffleServer latency metrics.

e243def

Add UT.

bc1233e

Merge branch 'master' into latency_metrics

2ad8c79

roryqi reviewed Nov 16, 2022

View reviewed changes

roryqi changed the title ~~[FEATURE] Support ShuffleServer latency metrics.~~ [ISSUE-309][FEATURE] Support ShuffleServer latency metrics. Nov 16, 2022

leixianming added 2 commits November 16, 2022 11:45

Fix.

dd72bd9

Merge branch 'latency_metrics' of https://github.com/leixm/incubator-…

e53357f

…uniffle into latency_metrics merge.

roryqi reviewed Nov 16, 2022

View reviewed changes

Fix.

55719de

roryqi reviewed Nov 16, 2022

View reviewed changes

Fix.

91809d8

roryqi linked an issue Nov 16, 2022 that may be closed by this pull request

[FEATURE] Support ShuffleServer latency metrics #309

Closed

3 tasks

roryqi approved these changes Nov 16, 2022

View reviewed changes

roryqi merged commit 4004f44 into apache:master Nov 16, 2022

Conversation

leixm commented Nov 16, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roryqi Nov 16, 2022 •

edited

Loading

codecov-commenter commented Nov 16, 2022 •

edited

Loading

roryqi Nov 16, 2022 •

edited

Loading