Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple latency histogram metrics #89

Merged
merged 5 commits into from
Feb 28, 2022
Merged

Conversation

YongGang
Copy link
Contributor

@YongGang YongGang commented Feb 2, 2022

Changes:

  • upgrade codebase to Java 11
  • Switch to SpotBugs as FindBugs doesn't work from Java 9+
  • Add histogram latency metrics

Copy link
Contributor

@pdavidson100 pdavidson100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @YongGang. Quick first-pass review. It generally looks good. I have some comments and suggestions. I would also like to see some tests, even if it means refactoring a little to support it.

@@ -15,11 +17,25 @@

private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class);

public static Map<Integer, String> LATENCY_BUCKETS =
Map.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use TimeUnit.MINUTES.toMillis(60) to be more explicit about units? By convention, milliseconds are usually stored as Longs anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make this configurable, but I think that's something we could come back to.

private String replicationLatencySensorName(String topic) {
return topic + "-" + "replication-latency";
}

private String histogramLatencySensorName(String topic, String bucket) {
return topic + "-" + bucket + "-" + "histogram-latency";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the order strange here, we generally go from less to more specific in naming. How about topic + "-" + "histogram-latency" + "-" + bucket ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is align with the sensor pattern in S3 codebase which has topic and connector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YongGang OK, I guess the sensor name doesn't matter too much anyway - the sensor names and tags look good.

.stream()
.forEach(
sensorEntry -> {
if (millis > sensorEntry.getKey()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should this be >= so we are guaranteed to catch all records in the "0m" bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think you can simplify a bit here:

bucketSensors
            .forEach((edgeMillis, bucket) -> {
              if (millis >= edgeMillis) {
                bucket.record(1);
              }
            });

30 * 60 * 1000,
"30m",
60 * 60 * 1000,
"60m");
Copy link
Contributor

@pdavidson100 pdavidson100 Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we commit to these buckets we should discuss. We definitely need 0,5,10. Perhaps we should add one more for very old records, maybe 12 or 24 hours?

@YongGang
Copy link
Contributor Author

YongGang commented Feb 5, 2022

Thanks @YongGang. Quick first-pass review. It generally looks good. I have some comments and suggestions. I would also like to see some tests, even if it means refactoring a little to support it.

I added a test class, but it's simple one. Due to the use of WindowedSum for metrics if we want a precise test case we basically need to re-engineer the logic in this class, may not worth the effort.
https://github.com/apache/kafka/blob/2.6/clients/src/main/java/org/apache/kafka/common/metrics/stats/Rate.java#L71-L91

Copy link
Contributor

@pdavidson100 pdavidson100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more questions and comments.

@@ -15,11 +17,25 @@

private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class);

public static Map<Integer, String> LATENCY_BUCKETS =
Map.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make this configurable, but I think that's something we could come back to.

@@ -38,16 +56,25 @@
"replication-latency-ms-avg", SOURCE_CONNECTOR_GROUP,
"Average time it takes records to replicate from source to target cluster.", TOPIC_TAGS);

protected static final MetricNameTemplate HISTOGRAM_LATENCY =
new MetricNameTemplate(
"histogram-bucket-latency",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: replication-latency-histogram, to tie it in with the other replication latency metrics (max and avg).

new MetricNameTemplate(
"histogram-bucket-latency",
SOURCE_CONNECTOR_GROUP,
"Metrics counting the number of records produced in each of a small set of latency buckets.",
Copy link
Contributor

@pdavidson100 pdavidson100 Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using a rate metric with a time unit of SECONDS here, so this is reporting the number of records per second for each "bucket". And this is actually a kind of cumulative histogram, not a normal histogram, so I suggest: "Cumulative histogram counting records delivered per second with latency exceeding a set of fixed bucket thresholds."

@@ -112,6 +149,14 @@ public synchronized void recordMirrorLatency(String topic, long millis) {
if (sensor != null) {
sensor.record((double) millis);
}

Map<Long, Sensor> bucketSensors = histogramLatencySensors.get(topic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like how we're explicitly reporting zeros for empty buckets for every topic, I'm concerned it might generate too many metrics and slow down our queries. Perhaps we should only report non-zero values? I'm not sure if that would be easy, or even possible, but something to consider. Perhaps we can filter out zeros in jmxtrans instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdavidson100 non-zero values you mean we don't report records in the 0m bucket? As this metrics only report every second, I don't think it will generate many metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that we report metrics with a count of zero in most buckets because latency is usually < 10 mins - that adds up to a lot of useless data sent to Argus (one metric per minute per topic per bucket per worker). Wondering if we can/should reduce that by not sending metrics for empty buckets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is we only report records that the lag is larger than bucket boundaries, we won't report zero values.
We created sensors for each bucket but they may be never/seldom used.
That's also what I see when testing: it shows no data when use 12h bucket in the metrics graph for example.
https://github.com/salesforce/mirus/pull/89/files?diff=unified&w=1#diff-0c6c96fc358c3316aa5312ae1ce3c23580d68722f1db56874ac4366994bedd1eR156-R157

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange - I made this comment after noticing a lot of zeros in Argus for the PRD test cluster.

Copy link
Collaborator

@d4v1de d4v1de Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of populating all of the buckets that are smaller than the reported latency, what do you think of the following approach?

  • store the bucketSensors in a SortedMap, in descending order
  • loop through them like we do here
  • break the loop right after we have recorded the entry once

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless we need this to be a cumulative histogram)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After chatting with Paul I realized that there are advantages in using a cumulative histogram, specifically that adding new buckets in the future won't affect existing queries, so I see now why the current approach is preferable.

MirrorJmxReporter.HISTOGRAM_LATENCY.description(),
tags))
.metricValue();
Assert.assertNotNull(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of also checking that there is exactly one occurrence per bucket?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether we stick with the cumulative histogram implementation, shall we also check that the other buckets are empty?

TopicPartition topicPartition = new TopicPartition(TEST_TOPIC, 1);
mirrorJmxReporter.addTopics(List.of(topicPartition));

mirrorJmxReporter.recordMirrorLatency(TEST_TOPIC, 500);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered adding a comment here, to declare in which buckets these entries are expected to land?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants