Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage Collection HealthCheck never triggers usefully #3291

Closed
bismuth opened this issue Oct 18, 2018 · 14 comments · Fixed by #4390
Closed

Garbage Collection HealthCheck never triggers usefully #3291

bismuth opened this issue Oct 18, 2018 · 14 comments · Fixed by #4390
Assignees
Milestone

Comments

@bismuth
Copy link

bismuth commented Oct 18, 2018

Expected outcome

Receive CRITICAL or WARNING or GOOD alerts from the HealthCheck service when Garbage Collection is doing really badly, badly, or fine.

Current outcome

A CRITICAL alert is almost always triggered the second time this specific HealthCheck runs, after that any message at all (depending on the alert levels) becomes increasingly unlikely the longer the application runs.

Steps to reproduce:

  1. Enable HealthCheck, enable the GarbageCollection HealthCheck, set the good/warning/critical levels to 30/40/95, set the check interval to 5 minutes (this is what I tested, shorter intervals may not reliably trigger the alert, longer intervals just make you wait longer)
  2. Start payara, deploy something simple (but no so trivial as to never trigger a garbage collection)
  3. Wait 5 minutes (or whichever interval you configured)
  4. See a HealthCheck message in your log that looks like this: GBGC:Health Check Result:[[status=CRITICAL, message='19 times Young GC (G1 Young Generation) after 5 seconds 187 milliseconds'']']
  5. Optional: wait 5 more minutes, see a similar warning with "CRITICAL" replaced by WARNING
  6. Optional: wait 5 more minutes, see a similar warning with "WARNING" replaced by GOOD
  7. Optional: wait longer, never see a similar warning again.

(just to be complete: you may also see the same messages for the Old Generation GC, but that is it)

Context:

I don't think this check (or the one for old gen gc on line 125) works as a reasonable health check for garbage collection. If we take a simple program that quickly reaches some steady state in terms of garbage collection, where the healthcheck-interval is much longer than the time to reach that steady state.

The first time the check is triggered youngLastCollectionCount is 0, so nothing gets done. Then youngLastCollectionCount gets set to gcBean.getCollectionCount() lets call this value x, and since we've been in steady state for some time already we can expect to add about x to the number of collections each time the check runs, so diffCount will be about x on each run of the check (more precisely: diffCount = n*x -(n-1)*x where n is the number of times the check has run since application start).

The second time the check runs we get: percentage = ((2x - x)/x) *100 = 100. So unless the good, warning and critical thresholds are all set at 100 we trigger a HealthCheck message of some kind.

The third time we get: percentage = ((3x-2x)/2x) *100 = (x/2x)*100 = 50.

The fourth time we get: percentage = ((4x-3x)/3x) *100 = (x/3x)*100 = 33.

The nth time we get: percentage = ((n*x - (n-1)*x) / ((n-1) * x)) * 100 = (1/(n-1))*100 which approaches 100/n or 0

Programs with a more variable memory usage pattern may trigger a message a bit later during runtime, but eventually the percentage calculation for them will also tend to 0

I'd be happy to make a pull request to fix this. I assume the calculation is intended to be very lightweight. Are there other considerations?

@mulderbaba mulderbaba self-assigned this Oct 20, 2018
@smillidge
Copy link
Contributor

Internal bug ID PAYARA-3453 scheduled for 191 release.

@mulderbaba
Copy link
Contributor

@bismuth collectionCount and collectionTime should be the cumulative values since the start of JVM so gcBean.getCollectionCount() should be returning a cumulative value of x instead of just an x. Please elaborate your sample if I'm getting it wrong. If you have in a steady state and have say 9 gc counts collected, then the diffCount will always be 0 hence the percentage will be 0 as well.

@bismuth
Copy link
Author

bismuth commented Mar 28, 2019

Hi @mulderbaba, thank you for taking the time to look into this!
We get an x instead of the cumulative value in the calculation of percentage because we subtract the previous cumulative value (youngLastCollectionCount) from the current value of gcBean.getCollectionCount() in line 104 of GarbageCollectionCheck.java. If the check runs -say- every minute and we are in a steady state of 10 collections per minute right from the start:

After 10 minutes of runtime. youngLastCollectionCount will be about 100 (10 minutes * 10 gc's per minute). If we stay in the steady state then percentage for the next check will come to 10/100=10%.

After 1 hour of runtime. youngLastCollectionCount will be about 600 (1 hour * 60 minute per hour * 10 gc's per minute). If we stay in the steady state then percentage for the next check will come to 10/600=1.66%.

After 10 hours of runtime. youngLastCollectionCount will be about 6000 (10 hours * 60 minute per hour * 10 gc's per minute). If we stay in the steady state then percentage for the next check will be 10/6000=0.166%.

Conversely, to get percentage to 100% you'd need just 100 collections in the last minute after 10 minutes of runtime, you'd need 600 collections in a minute after one hour, and 6000 after 10 hours.

In any realistic situation percentage would never get about 0% after and reasonable runtime no matter how severe garbage collection counts rises.

From your comment I'm not entirely sure if I'm missing something, so I made a table for how I think it goes. run# is the number of times the check has run since application start. I assume the check runs at a fixed interval, so total runtime would be run# * interval:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 10 0 10 not calculated
2 20 10 10 100%
3 30 20 10 50%
4 40 30 10 33%
5 50 40 10 25%
6 60 50 10 20%
...
10 100 90 10 10%
...
60 600 590 10 1.69%
...
100 1000 990 10 1.01%
...
600 6000 5990 10 0.17%
...
1000 10000 9990 10 0.01%

@mulderbaba
Copy link
Contributor

mulderbaba commented Mar 28, 2019

@bismuth thanks for the detailed explanation, really appreciate it.
I think calculating percentages on a counted value (which starts from 0) won't make sense how hard we try to calculate :) I am open to suggestions so all comments are welcome.

Let me go through your table. You are collecting 10 GC counts in a fixed interval, say 5 minutes. The only percentage calculation we could do here is to give the change in percentage comparing the change to the previous one. So this explains the drop in the percentage since cumulative count gets an increase.

Maybe we can do a dynamic calculation but that won't make any sense since we print out the percentages to the logs and what gets printed is printed, no turning back right? A sample for example:

1st:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 5 0 5 %100

2nd:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 5 0 5 %100
2 10 5 5 %100

3rd:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 5 0 5 %33
2 10 5 5 %33
3 25 10 15 %100

4rd:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 5 0 5 %20
2 10 5 5 %20
3 25 10 15 %60
4 50 25 25 %100

5th:

run# getCollectionCount() youngLastCollectionCount diffCount percentage
1 5 0 5 %20
2 10 5 5 %20
3 25 10 15 %60
4 50 25 25 %100
5 55 50 5 %20

So having a maximum (or call it a spike) at the end like this would reduce the previous counts. I don't like storing data like this in a server, which might be up and running for a good deal of time. So that's another concern.

One other approach could be having count values defined for thresholds (or maybe execution times). So user can define his own threshold values like 10 counts is good, 50 counts is ok, 100 count is real bad. This would break some backward compatibility that we have but I'll think about it before changing the implementation as well.

@bismuth
Copy link
Author

bismuth commented Mar 29, 2019

@mulderbaba I don't have a good intuition for a reasonable number for gccount over a given interval, and I think the preference for this type of check would be to have a default that works reasonably well out of the box for most cases. I have two suggestions:

  1. A check similar to the current check but using getCollectionTime(). This is also a cummulative measure, so you take the change over the last interval and check on that. I assume that in the absolute worst case all the time since the last run of this check would have been filled, so (0 <= gcTime <= now-lastCheckTime) (new member long variable of the class). If the returned time is something else (i.e. time per core or something) this doesn't work, but otherwise this would be about as usefull as the other HealthChecks provided:
long curCollectionTime = gcBean.getCollectionTime();
//time spent collecting garbage since the last check.
long diffTime = curCollectionTime - youngLastCollectionTime;
long now = System.currentTimeMillis();
//time elapsed since the last check.
long elapsedTime = now - youngLastCheckTime;

// on first run youngLastCheckTime == 0 and there is no usefull check. 
if (elapsedTime > 0 && lastCheckTime > 0) {
    // % of last period spent collecting garbage. I'm not sure what reasonable GOOD/WARNING/CRITICAL thresholds would be.
    percentage = ((diffTime) / (elapsedTime)) * 100;

    result.add(new HealthCheckResultEntry(decideOnStatusWithRatio(percentage),
                            prettyPrintDuration(diffTime) + " spent doing Young GC (" + gcBean.getName()  + 
                            ") during: " + prettyPrintDuration(elapsedTime)));
}

youngLastCollectionTime = curCollectionTime;
youngLastCheckTime = now;
  1. Something more complex and adaptive using the collection counts. I can think of a few options but without knowledge of how the number of gcCounts varies with time for different use-cases I don't know what would be reasonable in general or even if something generally applicable is possible.

If you want I can make a pull request out-of my first suggestion.

@mulderbaba
Copy link
Contributor

mulderbaba commented Mar 29, 2019

@bismuth that was what I was thinking, using time instead of the count in order to make a decent calculation as the check. Your calculation seems legit but if we set longer intervals like 5 minutes as you stated in your example, that will mean 300.000ms as elapsedTime. And if we ratio the last GC collection time (which would be between 100ms-150ms) to 300k ms that would probably result in something like %0,03. And that is for just 5 minutes, I am sure that there are others around doing checks on longer intervals than this.

My adaptive check seems more reasonable but as I mentioned I don't like storing time series data for a calculation like this, I may use an expirable list with a decent TTL value set, and that would only make sense then.

@svendiedrichsen
Copy link
Contributor

svendiedrichsen commented Mar 29, 2019

@mulderbaba How about using some library for statistical evaluations like Apache Commons?

http://commons.apache.org/proper/commons-math/userguide/stat.html

It lets you comfortably control the amount statistical data collected.

@mulderbaba
Copy link
Contributor

@svendiedrichsen I was thinking about it as well but didn't want to introduce another artifact. But it would be the only feasible option to handle the time-series data. thanks for the pointers.

@smillidge
Copy link
Contributor

I don't want to add Apache commons as an artefact. It's commonly used in deployed applications so will be problematic.

@svendiedrichsen
Copy link
Contributor

How can this be problematic if the applications have their own classloader? Shouldn't their version have preference for them if they are deployed with a different one?

@mulderbaba
Copy link
Contributor

@svendiedrichsen OSGI and classloaders working in harmony? :)

@svendiedrichsen
Copy link
Contributor

svendiedrichsen commented Apr 8, 2019

@mulderbaba At least HK2 is advertising this here: https://javaee.github.io/hk2/extensibility.html#classloading

... The mind boggles at all the ways HK2Loader can be implemented.

Glassfish maybe has had some issues with classloading which could use some reworking. But generally it should be possible to shield the app from the server.

@jbee
Copy link
Contributor

jbee commented Dec 13, 2019

I just found the same issue white looking at GarbageCollectorHealthCheck - I am going to change it as part of APPSERVER-11 so it is the relative time within the sliding window time-frame it was busy doing GC. So 10sec of GC in a 100sec window is 10%.

@rdebusscher rdebusscher added the Status: Accepted Confirmed defect or accepted improvement to implement, issue has been escalated to Platform Dev label Dec 17, 2019
@jbee
Copy link
Contributor

jbee commented Dec 17, 2019

Will be fixed as part of #4390

@rdebusscher rdebusscher assigned jbee and unassigned mulderbaba Jan 3, 2020
@rdebusscher rdebusscher added 3:DevInProgress and removed Status: Accepted Confirmed defect or accepted improvement to implement, issue has been escalated to Platform Dev labels Jan 3, 2020
@jbee jbee added this to the 5.201 milestone Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants