Add loadSuccessTime() and loadFailureTime() to CacheStats #409

john-karp · 2020-04-23T19:37:51Z

CacheStats tracks loadSuccessCount and loadFailureCount independently, but doesn't do a corresponding breakdown of load time.

I think it would be helpful to know if time is being spent in success vs failure.

Came up in micrometer-metrics/micrometer#2020

ben-manes · 2020-04-26T18:12:38Z

Can you explain the benefits from the perspective of the metrics themselves? I understand it is nice consistency for metric reporting, but I do not know how you would operationally react to this data. I suspect that is why they were combined in Guava, where this was never asked for. More often you would be trying to understand why failures occurred at all, rather than the time spent on each.

Note that you can supply your own StatsCounter which could track these independently.

jkschneider · 2020-05-18T16:58:19Z

@ben-manes One of the big benefits would be in unifying both success and fail outcomes under one metric name, something like loads which could be dimensionally drilled down to the different outcomes.

Many monitoring systems don't support shipping a timer loads{outcome=success} and a counter loads{outcome=fasilure}. They would expect that all metrics with the same name have the same type. So we wind up stuffing what typically is a tag (outcome) in the metric name to differentiate them.

Having unified the metric names, we can plot total attempted loads by simply plotting the count statistic shipped by loads. monitoring systems by default sum across distinct tags unless you drill down on one of them. This is more intuitive than having to do load.success + load.failure when building a dashboard.

Also, plotting error ratio as a function of throughput (rather than error rate which isn't as useful) is more intuitive. It becomes loads{outcome=failure}/loads.

More...

To really go to the next level, if there are different failure modes under which a load operation can fail, tag with that detail and a coarse tag of outcome. Something like loads{outcome=failure, reason=failureMode1}, loads{outcome=failure, reason=failureMode2}. I don't even know if this identifiable. But in general, the idea is that different failure modes can have different latency characteristics. Eager failures are faster than a successful load, timeouts are slower.

More more...

The really ideal case would be if Caffeine could allow for Micrometer to record with a regular timer. The most useful latency metrics are max and high percentiles, but that detail is lost in the abstraction right now. The best we can do with a FunctionTimer is provide access to throughput and average.

ben-manes · 2020-05-18T17:29:29Z

For more advanced cases, you could implement your own StatsCounter. See this example using dropwizard metrics. That pushes the metrics rather than having to poll the summation from CacheStats.

The cache only knows of a failure when either an exception is thrown or the expected entry is absent. This is reported identically and, most often, is unexpected failures that cause application errors. The fine grained causes would have to be instrumented in application code.

Given that failures that we can report are typically an application bug to be fixed, I'm still confused as to the operational value. Yes, for symmetry it is a reasonable ask. But I am confused as to how this metric would help developers?

It would seem that this particular ask is not very valuable but that supplying a custom StatsCounter could provide a richer experience. I am open to making additions, but would like to understand if there are more concrete benefits.

ben-manes · 2021-01-03T07:52:52Z

I am working on v3 where we could make this change. Unfortunately my concerns were not addressed in your reply so I am still inclined to not provide this. I do not see how it would assist operational teams (performance, SREs) and therefore lean against tracking metrics that don’t carry their weight.

john-karp · 2021-01-04T16:39:57Z

FYI, I implemented StatsCounter for micrometer: micrometer-metrics/micrometer#2163

So micrometer users who want the full breakdown will be able to get what they want that way.

ben-manes · 2021-01-11T03:55:21Z

Thanks, I think that's a nice approach. Closing.

jkschneider · 2021-02-03T17:18:03Z

@john-karp The StatsCounter has been merged in Micrometer. Thanks!

ben-manes mentioned this issue May 17, 2020

Use FunctionTimer for cache.load.duration (fixes #1880) micrometer-metrics/micrometer#2020

Closed

john-karp mentioned this issue Jun 21, 2020

Add implementation of Caffeine's StatsCounter micrometer-metrics/micrometer#2163

Merged

ben-manes closed this as completed Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

john-karp commented Apr 23, 2020 •

edited

Loading

ben-manes commented Apr 26, 2020

jkschneider commented May 18, 2020 •

edited

Loading

ben-manes commented May 18, 2020

ben-manes commented Jan 3, 2021

john-karp commented Jan 4, 2021

ben-manes commented Jan 11, 2021

jkschneider commented Feb 3, 2021

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

Comments

john-karp commented Apr 23, 2020 • edited Loading

ben-manes commented Apr 26, 2020

jkschneider commented May 18, 2020 • edited Loading

More...

More more...

ben-manes commented May 18, 2020

ben-manes commented Jan 3, 2021

john-karp commented Jan 4, 2021

ben-manes commented Jan 11, 2021

jkschneider commented Feb 3, 2021

john-karp commented Apr 23, 2020 •

edited

Loading

jkschneider commented May 18, 2020 •

edited

Loading