Add simple latency histogram metrics #89

YongGang · 2022-02-02T23:03:06Z

Changes:

upgrade codebase to Java 11
Switch to SpotBugs as FindBugs doesn't work from Java 9+
Add histogram latency metrics

pdavidson100

Thanks @YongGang. Quick first-pass review. It generally looks good. I have some comments and suggestions. I would also like to see some tests, even if it means refactoring a little to support it.

pdavidson100 · 2022-02-03T18:48:39Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

@@ -15,11 +17,25 @@

  private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class);

+  public static Map<Integer, String> LATENCY_BUCKETS =
+      Map.of(


Could we use TimeUnit.MINUTES.toMillis(60) to be more explicit about units? By convention, milliseconds are usually stored as Longs anyway.

It would be nice to make this configurable, but I think that's something we could come back to.

pdavidson100 · 2022-02-03T19:05:15Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

  private String replicationLatencySensorName(String topic) {
    return topic + "-" + "replication-latency";
  }
+
+  private String histogramLatencySensorName(String topic, String bucket) {
+    return topic + "-" + bucket + "-" + "histogram-latency";


I find the order strange here, we generally go from less to more specific in naming. How about topic + "-" + "histogram-latency" + "-" + bucket ?

I think this name is align with the sensor pattern in S3 codebase which has topic and connector.

@YongGang OK, I guess the sensor name doesn't matter too much anyway - the sensor names and tags look good.

pdavidson100 · 2022-02-03T19:06:35Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

+        .stream()
+        .forEach(
+            sensorEntry -> {
+              if (millis > sensorEntry.getKey()) {


I think this should this be >= so we are guaranteed to catch all records in the "0m" bucket.

Actually, I think you can simplify a bit here:

bucketSensors .forEach((edgeMillis, bucket) -> { if (millis >= edgeMillis) { bucket.record(1); } });

pdavidson100 · 2022-02-03T19:10:48Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

+          30 * 60 * 1000,
+          "30m",
+          60 * 60 * 1000,
+          "60m");


Before we commit to these buckets we should discuss. We definitely need 0,5,10. Perhaps we should add one more for very old records, maybe 12 or 24 hours?

YongGang · 2022-02-05T00:32:21Z

Thanks @YongGang. Quick first-pass review. It generally looks good. I have some comments and suggestions. I would also like to see some tests, even if it means refactoring a little to support it.

I added a test class, but it's simple one. Due to the use of WindowedSum for metrics if we want a precise test case we basically need to re-engineer the logic in this class, may not worth the effort.
https://github.com/apache/kafka/blob/2.6/clients/src/main/java/org/apache/kafka/common/metrics/stats/Rate.java#L71-L91

pdavidson100

Some more questions and comments.

pdavidson100 · 2022-02-08T01:16:29Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

@@ -15,11 +17,25 @@

  private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class);

+  public static Map<Integer, String> LATENCY_BUCKETS =
+      Map.of(


It would be nice to make this configurable, but I think that's something we could come back to.

pdavidson100 · 2022-02-08T01:19:41Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

@@ -38,16 +56,25 @@
          "replication-latency-ms-avg", SOURCE_CONNECTOR_GROUP,
          "Average time it takes records to replicate from source to target cluster.", TOPIC_TAGS);

+  protected static final MetricNameTemplate HISTOGRAM_LATENCY =
+      new MetricNameTemplate(
+          "histogram-bucket-latency",


Suggest: replication-latency-histogram, to tie it in with the other replication latency metrics (max and avg).

pdavidson100 · 2022-02-08T01:48:56Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

+      new MetricNameTemplate(
+          "histogram-bucket-latency",
+          SOURCE_CONNECTOR_GROUP,
+          "Metrics counting the number of records produced in each of a small set of latency buckets.",


You're using a rate metric with a time unit of SECONDS here, so this is reporting the number of records per second for each "bucket". And this is actually a kind of cumulative histogram, not a normal histogram, so I suggest: "Cumulative histogram counting records delivered per second with latency exceeding a set of fixed bucket thresholds."

pdavidson100 · 2022-02-08T02:05:45Z

src/main/java/com/salesforce/mirus/metrics/MirrorJmxReporter.java

@@ -112,6 +149,14 @@ public synchronized void recordMirrorLatency(String topic, long millis) {
    if (sensor != null) {
      sensor.record((double) millis);
    }
+
+    Map<Long, Sensor> bucketSensors = histogramLatencySensors.get(topic);


While I like how we're explicitly reporting zeros for empty buckets for every topic, I'm concerned it might generate too many metrics and slow down our queries. Perhaps we should only report non-zero values? I'm not sure if that would be easy, or even possible, but something to consider. Perhaps we can filter out zeros in jmxtrans instead?

@pdavidson100 non-zero values you mean we don't report records in the 0m bucket? As this metrics only report every second, I don't think it will generate many metrics.

I mean that we report metrics with a count of zero in most buckets because latency is usually < 10 mins - that adds up to a lot of useless data sent to Argus (one metric per minute per topic per bucket per worker). Wondering if we can/should reduce that by not sending metrics for empty buckets.

My understanding is we only report records that the lag is larger than bucket boundaries, we won't report zero values.
We created sensors for each bucket but they may be never/seldom used.
That's also what I see when testing: it shows no data when use 12h bucket in the metrics graph for example.
https://github.com/salesforce/mirus/pull/89/files?diff=unified&w=1#diff-0c6c96fc358c3316aa5312ae1ce3c23580d68722f1db56874ac4366994bedd1eR156-R157

Strange - I made this comment after noticing a lot of zeros in Argus for the PRD test cluster.

Instead of populating all of the buckets that are smaller than the reported latency, what do you think of the following approach?

store the bucketSensors in a SortedMap, in descending order

loop through them like we do here

break the loop right after we have recorded the entry once

(unless we need this to be a cumulative histogram)

After chatting with Paul I realized that there are advantages in using a cumulative histogram, specifically that adding new buckets in the future won't affect existing queries, so I see now why the current approach is preferable.

d4v1de · 2022-02-22T20:06:17Z

src/test/java/com/salesforce/mirus/metrics/MirrorJmxReporterTest.java

+                    MirrorJmxReporter.HISTOGRAM_LATENCY.description(),
+                    tags))
+            .metricValue();
+    Assert.assertNotNull(value);


What do you think of also checking that there is exactly one occurrence per bucket?

Regardless of whether we stick with the cumulative histogram implementation, shall we also check that the other buckets are empty?

d4v1de · 2022-02-22T20:07:38Z

src/test/java/com/salesforce/mirus/metrics/MirrorJmxReporterTest.java

+    TopicPartition topicPartition = new TopicPartition(TEST_TOPIC, 1);
+    mirrorJmxReporter.addTopics(List.of(topicPartition));
+
+    mirrorJmxReporter.recordMirrorLatency(TEST_TOPIC, 500);


Have you considered adding a comment here, to declare in which buckets these entries are expected to land?

YongGang Che added 2 commits February 2, 2022 14:59

Add simple latency histogram metrics

224e1df

remove Java 8 build

6315f08

pdavidson100 reviewed Feb 3, 2022

View reviewed changes

add test case

e5561ea

pdavidson100 reviewed Feb 8, 2022

View reviewed changes

reduce zero value bucket metrics publishing

d5d94a2

d4v1de reviewed Feb 22, 2022

View reviewed changes

use SortedMap to reduce bucket comparison

003bc8a

d4v1de approved these changes Feb 25, 2022

View reviewed changes

YongGang merged commit e702f27 into salesforce:master Feb 28, 2022

hrvbos mentioned this pull request Oct 21, 2022

Excessive logging (Exception thrown while calling task.commitRecord()) #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple latency histogram metrics #89

Add simple latency histogram metrics #89

YongGang commented Feb 2, 2022

pdavidson100 left a comment

pdavidson100 Feb 3, 2022

YongGang Feb 5, 2022

pdavidson100 Feb 8, 2022

pdavidson100 Feb 3, 2022

YongGang Feb 5, 2022

pdavidson100 Feb 7, 2022

pdavidson100 Feb 3, 2022

pdavidson100 Feb 3, 2022

pdavidson100 Feb 3, 2022 •

edited

Loading

YongGang commented Feb 5, 2022

pdavidson100 left a comment

pdavidson100 Feb 8, 2022

pdavidson100 Feb 8, 2022

pdavidson100 Feb 8, 2022 •

edited

Loading

pdavidson100 Feb 8, 2022

YongGang Feb 10, 2022

pdavidson100 Feb 10, 2022

YongGang Feb 10, 2022

pdavidson100 Feb 10, 2022

d4v1de Feb 22, 2022 •

edited

Loading

d4v1de Feb 22, 2022

d4v1de Feb 23, 2022

d4v1de Feb 22, 2022

d4v1de Feb 22, 2022

d4v1de Feb 22, 2022

Add simple latency histogram metrics #89

Add simple latency histogram metrics #89

Conversation

YongGang commented Feb 2, 2022

pdavidson100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdavidson100 Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

YongGang commented Feb 5, 2022

pdavidson100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdavidson100 Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4v1de Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdavidson100 Feb 3, 2022 •

edited

Loading

pdavidson100 Feb 8, 2022 •

edited

Loading

d4v1de Feb 22, 2022 •

edited

Loading