-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage Collection HealthCheck never triggers usefully #3291
Comments
Internal bug ID PAYARA-3453 scheduled for 191 release. |
@bismuth |
Hi @mulderbaba, thank you for taking the time to look into this! After 10 minutes of runtime. After 1 hour of runtime. After 10 hours of runtime. Conversely, to get In any realistic situation From your comment I'm not entirely sure if I'm missing something, so I made a table for how I think it goes. run# is the number of times the check has run since application start. I assume the check runs at a fixed interval, so total runtime would be
|
@bismuth thanks for the detailed explanation, really appreciate it. Let me go through your table. You are collecting 10 GC counts in a fixed interval, say 5 minutes. The only percentage calculation we could do here is to give the change in percentage comparing the change to the previous one. So this explains the drop in the percentage since cumulative count gets an increase. Maybe we can do a dynamic calculation but that won't make any sense since we print out the percentages to the logs and what gets printed is printed, no turning back right? A sample for example: 1st:
2nd:
3rd:
4rd:
5th:
So having a maximum (or call it a spike) at the end like this would reduce the previous counts. I don't like storing data like this in a server, which might be up and running for a good deal of time. So that's another concern. One other approach could be having count values defined for thresholds (or maybe execution times). So user can define his own threshold values like 10 counts is good, 50 counts is ok, 100 count is real bad. This would break some backward compatibility that we have but I'll think about it before changing the implementation as well. |
@mulderbaba I don't have a good intuition for a reasonable number for gccount over a given interval, and I think the preference for this type of check would be to have a default that works reasonably well out of the box for most cases. I have two suggestions:
If you want I can make a pull request out-of my first suggestion. |
@bismuth that was what I was thinking, using My adaptive check seems more reasonable but as I mentioned I don't like storing time series data for a calculation like this, I may use an expirable list with a decent TTL value set, and that would only make sense then. |
@mulderbaba How about using some library for statistical evaluations like Apache Commons? http://commons.apache.org/proper/commons-math/userguide/stat.html It lets you comfortably control the amount statistical data collected. |
@svendiedrichsen I was thinking about it as well but didn't want to introduce another artifact. But it would be the only feasible option to handle the time-series data. thanks for the pointers. |
I don't want to add Apache commons as an artefact. It's commonly used in deployed applications so will be problematic. |
How can this be problematic if the applications have their own classloader? Shouldn't their version have preference for them if they are deployed with a different one? |
@svendiedrichsen OSGI and classloaders working in harmony? :) |
@mulderbaba At least HK2 is advertising this here: https://javaee.github.io/hk2/extensibility.html#classloading
Glassfish maybe has had some issues with classloading which could use some reworking. But generally it should be possible to shield the app from the server. |
I just found the same issue white looking at |
Will be fixed as part of #4390 |
Payara/nucleus/payara-modules/healthcheck-core/src/main/java/fish/payara/nucleus/healthcheck/preliminary/GarbageCollectorHealthCheck.java
Line 108 in 70e98fb
Expected outcome
Receive CRITICAL or WARNING or GOOD alerts from the HealthCheck service when Garbage Collection is doing really badly, badly, or fine.
Current outcome
A CRITICAL alert is almost always triggered the second time this specific HealthCheck runs, after that any message at all (depending on the alert levels) becomes increasingly unlikely the longer the application runs.
Steps to reproduce:
GBGC:Health Check Result:[[status=CRITICAL, message='19 times Young GC (G1 Young Generation) after 5 seconds 187 milliseconds'']']
(just to be complete: you may also see the same messages for the Old Generation GC, but that is it)
Context:
I don't think this check (or the one for old gen gc on line 125) works as a reasonable health check for garbage collection. If we take a simple program that quickly reaches some steady state in terms of garbage collection, where the healthcheck-interval is much longer than the time to reach that steady state.
The first time the check is triggered
youngLastCollectionCount
is 0, so nothing gets done. ThenyoungLastCollectionCount
gets set togcBean.getCollectionCount()
lets call this value x, and since we've been in steady state for some time already we can expect to add about x to the number of collections each time the check runs, sodiffCount
will be aboutx
on each run of the check (more precisely:diffCount = n*x -(n-1)*x
wheren
is the number of times the check has run since application start).The second time the check runs we get:
percentage = ((2x - x)/x) *100 = 100
. So unless the good, warning and critical thresholds are all set at 100 we trigger a HealthCheck message of some kind.The third time we get:
percentage = ((3x-2x)/2x) *100 = (x/2x)*100 = 50
.The fourth time we get:
percentage = ((4x-3x)/3x) *100 = (x/3x)*100 = 33
.The nth time we get:
percentage = ((n*x - (n-1)*x) / ((n-1) * x)) * 100 = (1/(n-1))*100
which approaches100/n
or0
Programs with a more variable memory usage pattern may trigger a message a bit later during runtime, but eventually the percentage calculation for them will also tend to 0
I'd be happy to make a pull request to fix this. I assume the calculation is intended to be very lightweight. Are there other considerations?
The text was updated successfully, but these errors were encountered: