Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

Closed
john-karp opened this issue Apr 23, 2020 · 7 comments
Closed

Add loadSuccessTime() and loadFailureTime() to CacheStats #409

john-karp opened this issue Apr 23, 2020 · 7 comments

Comments

@john-karp
Copy link

john-karp commented Apr 23, 2020

CacheStats tracks loadSuccessCount and loadFailureCount independently, but doesn't do a corresponding breakdown of load time.

I think it would be helpful to know if time is being spent in success vs failure.

Came up in micrometer-metrics/micrometer#2020

@ben-manes
Copy link
Owner

Can you explain the benefits from the perspective of the metrics themselves? I understand it is nice consistency for metric reporting, but I do not know how you would operationally react to this data. I suspect that is why they were combined in Guava, where this was never asked for. More often you would be trying to understand why failures occurred at all, rather than the time spent on each.

Note that you can supply your own StatsCounter which could track these independently.

@jkschneider
Copy link

jkschneider commented May 18, 2020

@ben-manes One of the big benefits would be in unifying both success and fail outcomes under one metric name, something like loads which could be dimensionally drilled down to the different outcomes.

Many monitoring systems don't support shipping a timer loads{outcome=success} and a counter loads{outcome=fasilure}. They would expect that all metrics with the same name have the same type. So we wind up stuffing what typically is a tag (outcome) in the metric name to differentiate them.

Having unified the metric names, we can plot total attempted loads by simply plotting the count statistic shipped by loads. monitoring systems by default sum across distinct tags unless you drill down on one of them. This is more intuitive than having to do load.success + load.failure when building a dashboard.

Also, plotting error ratio as a function of throughput (rather than error rate which isn't as useful) is more intuitive. It becomes loads{outcome=failure}/loads.

More...

To really go to the next level, if there are different failure modes under which a load operation can fail, tag with that detail and a coarse tag of outcome. Something like loads{outcome=failure, reason=failureMode1}, loads{outcome=failure, reason=failureMode2}. I don't even know if this identifiable. But in general, the idea is that different failure modes can have different latency characteristics. Eager failures are faster than a successful load, timeouts are slower.

More more...

The really ideal case would be if Caffeine could allow for Micrometer to record with a regular timer. The most useful latency metrics are max and high percentiles, but that detail is lost in the abstraction right now. The best we can do with a FunctionTimer is provide access to throughput and average.

@ben-manes
Copy link
Owner

For more advanced cases, you could implement your own StatsCounter. See this example using dropwizard metrics. That pushes the metrics rather than having to poll the summation from CacheStats.

The cache only knows of a failure when either an exception is thrown or the expected entry is absent. This is reported identically and, most often, is unexpected failures that cause application errors. The fine grained causes would have to be instrumented in application code.

Given that failures that we can report are typically an application bug to be fixed, I'm still confused as to the operational value. Yes, for symmetry it is a reasonable ask. But I am confused as to how this metric would help developers?

It would seem that this particular ask is not very valuable but that supplying a custom StatsCounter could provide a richer experience. I am open to making additions, but would like to understand if there are more concrete benefits.

@ben-manes
Copy link
Owner

I am working on v3 where we could make this change. Unfortunately my concerns were not addressed in your reply so I am still inclined to not provide this. I do not see how it would assist operational teams (performance, SREs) and therefore lean against tracking metrics that don’t carry their weight.

@john-karp
Copy link
Author

FYI, I implemented StatsCounter for micrometer: micrometer-metrics/micrometer#2163

So micrometer users who want the full breakdown will be able to get what they want that way.

@ben-manes
Copy link
Owner

Thanks, I think that's a nice approach. Closing.

@jkschneider
Copy link

@john-karp The StatsCounter has been merged in Micrometer. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants