Garbage Collection HealthCheck never triggers usefully #3291

bismuth · 2018-10-18T21:21:27Z

Payara/nucleus/payara-modules/healthcheck-core/src/main/java/fish/payara/nucleus/healthcheck/preliminary/GarbageCollectorHealthCheck.java

Line 108 in 70e98fb

percentage = ((diffCount) / (youngLastCollectionCount)) * 100;

Expected outcome

Receive CRITICAL or WARNING or GOOD alerts from the HealthCheck service when Garbage Collection is doing really badly, badly, or fine.

Current outcome

A CRITICAL alert is almost always triggered the second time this specific HealthCheck runs, after that any message at all (depending on the alert levels) becomes increasingly unlikely the longer the application runs.

Steps to reproduce:

Enable HealthCheck, enable the GarbageCollection HealthCheck, set the good/warning/critical levels to 30/40/95, set the check interval to 5 minutes (this is what I tested, shorter intervals may not reliably trigger the alert, longer intervals just make you wait longer)
Start payara, deploy something simple (but no so trivial as to never trigger a garbage collection)
Wait 5 minutes (or whichever interval you configured)
See a HealthCheck message in your log that looks like this: GBGC:Health Check Result:[[status=CRITICAL, message='19 times Young GC (G1 Young Generation) after 5 seconds 187 milliseconds'']']
Optional: wait 5 more minutes, see a similar warning with "CRITICAL" replaced by WARNING
Optional: wait 5 more minutes, see a similar warning with "WARNING" replaced by GOOD
Optional: wait longer, never see a similar warning again.

(just to be complete: you may also see the same messages for the Old Generation GC, but that is it)

Context:

I don't think this check (or the one for old gen gc on line 125) works as a reasonable health check for garbage collection. If we take a simple program that quickly reaches some steady state in terms of garbage collection, where the healthcheck-interval is much longer than the time to reach that steady state.

The first time the check is triggered youngLastCollectionCount is 0, so nothing gets done. Then youngLastCollectionCount gets set to gcBean.getCollectionCount() lets call this value x, and since we've been in steady state for some time already we can expect to add about x to the number of collections each time the check runs, so diffCount will be about x on each run of the check (more precisely: diffCount = n*x -(n-1)*x where n is the number of times the check has run since application start).

The second time the check runs we get: percentage = ((2x - x)/x) *100 = 100. So unless the good, warning and critical thresholds are all set at 100 we trigger a HealthCheck message of some kind.

The third time we get: percentage = ((3x-2x)/2x) *100 = (x/2x)*100 = 50.

The fourth time we get: percentage = ((4x-3x)/3x) *100 = (x/3x)*100 = 33.

The nth time we get: percentage = ((n*x - (n-1)*x) / ((n-1) * x)) * 100 = (1/(n-1))*100 which approaches 100/n or 0

Programs with a more variable memory usage pattern may trigger a message a bit later during runtime, but eventually the percentage calculation for them will also tend to 0

I'd be happy to make a pull request to fix this. I assume the calculation is intended to be very lightweight. Are there other considerations?

The text was updated successfully, but these errors were encountered:

smillidge · 2019-01-29T22:01:47Z

Internal bug ID PAYARA-3453 scheduled for 191 release.

mulderbaba · 2019-03-28T07:44:09Z

@bismuth collectionCount and collectionTime should be the cumulative values since the start of JVM so gcBean.getCollectionCount() should be returning a cumulative value of x instead of just an x. Please elaborate your sample if I'm getting it wrong. If you have in a steady state and have say 9 gc counts collected, then the diffCount will always be 0 hence the percentage will be 0 as well.

bismuth · 2019-03-28T08:31:34Z

Hi @mulderbaba, thank you for taking the time to look into this!
We get an x instead of the cumulative value in the calculation of percentage because we subtract the previous cumulative value (youngLastCollectionCount) from the current value of gcBean.getCollectionCount() in line 104 of GarbageCollectionCheck.java. If the check runs -say- every minute and we are in a steady state of 10 collections per minute right from the start:

After 10 minutes of runtime. youngLastCollectionCount will be about 100 (10 minutes * 10 gc's per minute). If we stay in the steady state then percentage for the next check will come to 10/100=10%.

After 1 hour of runtime. youngLastCollectionCount will be about 600 (1 hour * 60 minute per hour * 10 gc's per minute). If we stay in the steady state then percentage for the next check will come to 10/600=1.66%.

After 10 hours of runtime. youngLastCollectionCount will be about 6000 (10 hours * 60 minute per hour * 10 gc's per minute). If we stay in the steady state then percentage for the next check will be 10/6000=0.166%.

Conversely, to get percentage to 100% you'd need just 100 collections in the last minute after 10 minutes of runtime, you'd need 600 collections in a minute after one hour, and 6000 after 10 hours.

In any realistic situation percentage would never get about 0% after and reasonable runtime no matter how severe garbage collection counts rises.

From your comment I'm not entirely sure if I'm missing something, so I made a table for how I think it goes. run# is the number of times the check has run since application start. I assume the check runs at a fixed interval, so total runtime would be run# * interval:

run#	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	10	0	10	not calculated
2	20	10	10	100%
3	30	20	10	50%
4	40	30	10	33%
5	50	40	10	25%
6	60	50	10	20%
...
10	100	90	10	10%
...
60	600	590	10	1.69%
...
100	1000	990	10	1.01%
...
600	6000	5990	10	0.17%
...
1000	10000	9990	10	0.01%

mulderbaba · 2019-03-28T19:46:41Z

@bismuth thanks for the detailed explanation, really appreciate it.
I think calculating percentages on a counted value (which starts from 0) won't make sense how hard we try to calculate :) I am open to suggestions so all comments are welcome.

Let me go through your table. You are collecting 10 GC counts in a fixed interval, say 5 minutes. The only percentage calculation we could do here is to give the change in percentage comparing the change to the previous one. So this explains the drop in the percentage since cumulative count gets an increase.

Maybe we can do a dynamic calculation but that won't make any sense since we print out the percentages to the logs and what gets printed is printed, no turning back right? A sample for example:

1st:

`run#`	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	5	0	5	%100

2nd:

`run#`	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	5	0	5	%100
2	10	5	5	%100

3rd:

`run#`	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	5	0	5	%33
2	10	5	5	%33
3	25	10	15	%100

4rd:

`run#`	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	5	0	5	%20
2	10	5	5	%20
3	25	10	15	%60
4	50	25	25	%100

5th:

`run#`	`getCollectionCount()`	`youngLastCollectionCount`	`diffCount`	`percentage`
1	5	0	5	%20
2	10	5	5	%20
3	25	10	15	%60
4	50	25	25	%100
5	55	50	5	%20

So having a maximum (or call it a spike) at the end like this would reduce the previous counts. I don't like storing data like this in a server, which might be up and running for a good deal of time. So that's another concern.

One other approach could be having count values defined for thresholds (or maybe execution times). So user can define his own threshold values like 10 counts is good, 50 counts is ok, 100 count is real bad. This would break some backward compatibility that we have but I'll think about it before changing the implementation as well.

bismuth · 2019-03-29T11:24:25Z

@mulderbaba I don't have a good intuition for a reasonable number for gccount over a given interval, and I think the preference for this type of check would be to have a default that works reasonably well out of the box for most cases. I have two suggestions:

A check similar to the current check but using getCollectionTime(). This is also a cummulative measure, so you take the change over the last interval and check on that. I assume that in the absolute worst case all the time since the last run of this check would have been filled, so (0 <= gcTime <= now-lastCheckTime) (new member long variable of the class). If the returned time is something else (i.e. time per core or something) this doesn't work, but otherwise this would be about as usefull as the other HealthChecks provided:

long curCollectionTime = gcBean.getCollectionTime();
//time spent collecting garbage since the last check.
long diffTime = curCollectionTime - youngLastCollectionTime;
long now = System.currentTimeMillis();
//time elapsed since the last check.
long elapsedTime = now - youngLastCheckTime;

// on first run youngLastCheckTime == 0 and there is no usefull check. 
if (elapsedTime > 0 && lastCheckTime > 0) {
    // % of last period spent collecting garbage. I'm not sure what reasonable GOOD/WARNING/CRITICAL thresholds would be.
    percentage = ((diffTime) / (elapsedTime)) * 100;

    result.add(new HealthCheckResultEntry(decideOnStatusWithRatio(percentage),
                            prettyPrintDuration(diffTime) + " spent doing Young GC (" + gcBean.getName()  + 
                            ") during: " + prettyPrintDuration(elapsedTime)));
}

youngLastCollectionTime = curCollectionTime;
youngLastCheckTime = now;

Something more complex and adaptive using the collection counts. I can think of a few options but without knowledge of how the number of gcCounts varies with time for different use-cases I don't know what would be reasonable in general or even if something generally applicable is possible.

If you want I can make a pull request out-of my first suggestion.

mulderbaba · 2019-03-29T17:08:28Z

@bismuth that was what I was thinking, using time instead of the count in order to make a decent calculation as the check. Your calculation seems legit but if we set longer intervals like 5 minutes as you stated in your example, that will mean 300.000ms as elapsedTime. And if we ratio the last GC collection time (which would be between 100ms-150ms) to 300k ms that would probably result in something like %0,03. And that is for just 5 minutes, I am sure that there are others around doing checks on longer intervals than this.

My adaptive check seems more reasonable but as I mentioned I don't like storing time series data for a calculation like this, I may use an expirable list with a decent TTL value set, and that would only make sense then.

svendiedrichsen · 2019-03-29T17:30:07Z

@mulderbaba How about using some library for statistical evaluations like Apache Commons?

http://commons.apache.org/proper/commons-math/userguide/stat.html

It lets you comfortably control the amount statistical data collected.

mulderbaba · 2019-03-30T10:02:53Z

@svendiedrichsen I was thinking about it as well but didn't want to introduce another artifact. But it would be the only feasible option to handle the time-series data. thanks for the pointers.

smillidge · 2019-03-30T17:31:55Z

I don't want to add Apache commons as an artefact. It's commonly used in deployed applications so will be problematic.

svendiedrichsen · 2019-03-31T14:40:35Z

How can this be problematic if the applications have their own classloader? Shouldn't their version have preference for them if they are deployed with a different one?

mulderbaba · 2019-04-07T12:31:56Z

@svendiedrichsen OSGI and classloaders working in harmony? :)

svendiedrichsen · 2019-04-08T09:01:22Z

@mulderbaba At least HK2 is advertising this here: https://javaee.github.io/hk2/extensibility.html#classloading

... The mind boggles at all the ways HK2Loader can be implemented.

Glassfish maybe has had some issues with classloading which could use some reworking. But generally it should be possible to shield the app from the server.

jbee · 2019-12-13T13:39:14Z

I just found the same issue white looking at GarbageCollectorHealthCheck - I am going to change it as part of APPSERVER-11 so it is the relative time within the sliding window time-frame it was busy doing GC. So 10sec of GC in a 100sec window is 10%.

jbee · 2019-12-17T15:36:49Z

Will be fixed as part of #4390

mulderbaba self-assigned this Oct 20, 2018

rdebusscher added the Status: Accepted Confirmed defect or accepted improvement to implement, issue has been escalated to Platform Dev label Dec 17, 2019

jbee mentioned this issue Dec 17, 2019

APPSERV-11 Adds Health Check Alerts to Monitoring Console #4390

Merged

rdebusscher assigned jbee and unassigned mulderbaba Jan 3, 2020

rdebusscher added 3:DevInProgress and removed Status: Accepted Confirmed defect or accepted improvement to implement, issue has been escalated to Platform Dev labels Jan 3, 2020

jbee closed this as completed in #4390 Jan 17, 2020

jbee removed the 3:DevInProgress label Jan 23, 2020

jbee added this to the 5.201 milestone Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage Collection HealthCheck never triggers usefully #3291

Garbage Collection HealthCheck never triggers usefully #3291

bismuth commented Oct 18, 2018 •

edited

Loading

smillidge commented Jan 29, 2019

mulderbaba commented Mar 28, 2019

bismuth commented Mar 28, 2019

mulderbaba commented Mar 28, 2019 •

edited

Loading

bismuth commented Mar 29, 2019

mulderbaba commented Mar 29, 2019 •

edited

Loading

svendiedrichsen commented Mar 29, 2019 •

edited

Loading

mulderbaba commented Mar 30, 2019

smillidge commented Mar 30, 2019

svendiedrichsen commented Mar 31, 2019

mulderbaba commented Apr 7, 2019

svendiedrichsen commented Apr 8, 2019 •

edited

Loading

jbee commented Dec 13, 2019

jbee commented Dec 17, 2019

Garbage Collection HealthCheck never triggers usefully #3291

Garbage Collection HealthCheck never triggers usefully #3291

Comments

bismuth commented Oct 18, 2018 • edited Loading

Expected outcome

Current outcome

Steps to reproduce:

Context:

smillidge commented Jan 29, 2019

mulderbaba commented Mar 28, 2019

bismuth commented Mar 28, 2019

mulderbaba commented Mar 28, 2019 • edited Loading

bismuth commented Mar 29, 2019

mulderbaba commented Mar 29, 2019 • edited Loading

svendiedrichsen commented Mar 29, 2019 • edited Loading

mulderbaba commented Mar 30, 2019

smillidge commented Mar 30, 2019

svendiedrichsen commented Mar 31, 2019

mulderbaba commented Apr 7, 2019

svendiedrichsen commented Apr 8, 2019 • edited Loading

jbee commented Dec 13, 2019

jbee commented Dec 17, 2019

bismuth commented Oct 18, 2018 •

edited

Loading

mulderbaba commented Mar 28, 2019 •

edited

Loading

mulderbaba commented Mar 29, 2019 •

edited

Loading

svendiedrichsen commented Mar 29, 2019 •

edited

Loading

svendiedrichsen commented Apr 8, 2019 •

edited

Loading