-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in CKMSQuantiles.java? #422
Comments
Hmm, that's not good. Did you wait at least 20m before taking the first snapshot? Does the growth continue similarly at about 3MB per hour? |
I waited one hour before doing the first snapshot, and yes it does seem to continuously grow at a consistent rate. I will double check this now as I am starting up some new nodes, will give an update tomorrow (it also doesn't appear to be entirely deterministic so I am clarifying this) |
@brian-brazil I am doing another run now, unfortunately it didn't work last night because K8s terminated our pod halfway through. BTW, the leak is more than 3mb per hour. 3mb is just the CKMSQuantiles$Item class, but the int's are also increasing (since they are contained within the class), in the above screenshot this is roughly 19 megs (for a total of ~22 megs per hour). 22 * 24 = 528 and our pods have 1.5 gigs of memory allocated which does roughly coincide with how often they seem to be shutting down due to OOM |
Okay so last update, we don't think its due to client_java. There were issues with our JVM pods when it came to GC (within the context of Docker) that was causing the JVM to go OOM. Will close the issue and reopen if I find out its actually due to client_java |
@mdedetrich can u describle the issues with your JVM pods? I've been meeting the same problem. |
See https://dzone.com/articles/why-my-java-application-is-oomkilled for more info |
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made.
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made. Signed-off-by: Jens <jenskat@gmail.com>
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (#422, #550, #654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made. Signed-off-by: Jens <jenskat@gmail.com>
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (#422, #550, #654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made. Signed-off-by: Jens <jenskat@gmail.com> Fix assertion in HistogramTest Median of 1..11 = 6 Signed-off-by: Jens <jenskat@gmail.com> Address PR remarks Signed-off-by: Jens <jenskat@gmail.com>
At work we have noticed that one of our applications have started leaking memory. After diagnosing with VisualVM over a period of time, I have noticed that the heap size of
io.prometheus.client.CKMSQuantiles$Item
is gradually increasing over time without it ever lowering.Using VisualVM, here is a screenshot of the first snaphot that we did


And here is the second, around an hour later
As you can see, the amounts of
int
's have also increased which seem to correspond to the members of the QuantilesItem
class.Although its expected that the number of items in the quantile increase, in our case the memory is never freed and our applications actually restart due to OOM roughly once a day. Our bucket size by default is 10 minutes, so it should be freeing up the memory shortly after that point.
We are btw using https://developer.lightbend.com/docs/cinnamon/current/plugins/prometheus/prometheus.html which uses client_java under the hood, you can also see the default configuration settings.
The text was updated successfully, but these errors were encountered: