-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simple latency histogram metrics #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @YongGang. Quick first-pass review. It generally looks good. I have some comments and suggestions. I would also like to see some tests, even if it means refactoring a little to support it.
@@ -15,11 +17,25 @@ | |||
|
|||
private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class); | |||
|
|||
public static Map<Integer, String> LATENCY_BUCKETS = | |||
Map.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use TimeUnit.MINUTES.toMillis(60)
to be more explicit about units? By convention, milliseconds are usually stored as Longs anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to make this configurable, but I think that's something we could come back to.
private String replicationLatencySensorName(String topic) { | ||
return topic + "-" + "replication-latency"; | ||
} | ||
|
||
private String histogramLatencySensorName(String topic, String bucket) { | ||
return topic + "-" + bucket + "-" + "histogram-latency"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the order strange here, we generally go from less to more specific in naming. How about topic + "-" + "histogram-latency" + "-" + bucket
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is align with the sensor pattern in S3 codebase which has topic and connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YongGang OK, I guess the sensor name doesn't matter too much anyway - the sensor names and tags look good.
.stream() | ||
.forEach( | ||
sensorEntry -> { | ||
if (millis > sensorEntry.getKey()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should this be >=
so we are guaranteed to catch all records in the "0m" bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think you can simplify a bit here:
bucketSensors
.forEach((edgeMillis, bucket) -> {
if (millis >= edgeMillis) {
bucket.record(1);
}
});
30 * 60 * 1000, | ||
"30m", | ||
60 * 60 * 1000, | ||
"60m"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we commit to these buckets we should discuss. We definitely need 0,5,10. Perhaps we should add one more for very old records, maybe 12 or 24 hours?
I added a test class, but it's simple one. Due to the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more questions and comments.
@@ -15,11 +17,25 @@ | |||
|
|||
private static final Logger logger = LoggerFactory.getLogger(MirrorJmxReporter.class); | |||
|
|||
public static Map<Integer, String> LATENCY_BUCKETS = | |||
Map.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to make this configurable, but I think that's something we could come back to.
@@ -38,16 +56,25 @@ | |||
"replication-latency-ms-avg", SOURCE_CONNECTOR_GROUP, | |||
"Average time it takes records to replicate from source to target cluster.", TOPIC_TAGS); | |||
|
|||
protected static final MetricNameTemplate HISTOGRAM_LATENCY = | |||
new MetricNameTemplate( | |||
"histogram-bucket-latency", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest: replication-latency-histogram
, to tie it in with the other replication latency metrics (max and avg).
new MetricNameTemplate( | ||
"histogram-bucket-latency", | ||
SOURCE_CONNECTOR_GROUP, | ||
"Metrics counting the number of records produced in each of a small set of latency buckets.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using a rate metric with a time unit of SECONDS here, so this is reporting the number of records per second for each "bucket". And this is actually a kind of cumulative histogram, not a normal histogram, so I suggest: "Cumulative histogram counting records delivered per second with latency exceeding a set of fixed bucket thresholds."
@@ -112,6 +149,14 @@ public synchronized void recordMirrorLatency(String topic, long millis) { | |||
if (sensor != null) { | |||
sensor.record((double) millis); | |||
} | |||
|
|||
Map<Long, Sensor> bucketSensors = histogramLatencySensors.get(topic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like how we're explicitly reporting zeros for empty buckets for every topic, I'm concerned it might generate too many metrics and slow down our queries. Perhaps we should only report non-zero values? I'm not sure if that would be easy, or even possible, but something to consider. Perhaps we can filter out zeros in jmxtrans instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdavidson100 non-zero values you mean we don't report records in the 0m bucket? As this metrics only report every second, I don't think it will generate many metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we report metrics with a count of zero in most buckets because latency is usually < 10 mins - that adds up to a lot of useless data sent to Argus (one metric per minute per topic per bucket per worker). Wondering if we can/should reduce that by not sending metrics for empty buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is we only report records that the lag is larger than bucket boundaries, we won't report zero values.
We created sensors for each bucket but they may be never/seldom used.
That's also what I see when testing: it shows no data when use 12h bucket in the metrics graph for example.
https://github.com/salesforce/mirus/pull/89/files?diff=unified&w=1#diff-0c6c96fc358c3316aa5312ae1ce3c23580d68722f1db56874ac4366994bedd1eR156-R157
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange - I made this comment after noticing a lot of zeros in Argus for the PRD test cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of populating all of the buckets that are smaller than the reported latency, what do you think of the following approach?
- store the bucketSensors in a SortedMap, in descending order
- loop through them like we do here
- break the loop right after we have recorded the entry once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unless we need this to be a cumulative histogram)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting with Paul I realized that there are advantages in using a cumulative histogram, specifically that adding new buckets in the future won't affect existing queries, so I see now why the current approach is preferable.
MirrorJmxReporter.HISTOGRAM_LATENCY.description(), | ||
tags)) | ||
.metricValue(); | ||
Assert.assertNotNull(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of also checking that there is exactly one occurrence per bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of whether we stick with the cumulative histogram implementation, shall we also check that the other buckets are empty?
TopicPartition topicPartition = new TopicPartition(TEST_TOPIC, 1); | ||
mirrorJmxReporter.addTopics(List.of(topicPartition)); | ||
|
||
mirrorJmxReporter.recordMirrorLatency(TEST_TOPIC, 500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered adding a comment here, to declare in which buckets these entries are expected to land?
Changes: