Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in CKMSQuantiles.java? #422

Closed
mdedetrich opened this issue Sep 10, 2018 · 6 comments
Closed

Memory leak in CKMSQuantiles.java? #422

mdedetrich opened this issue Sep 10, 2018 · 6 comments

Comments

@mdedetrich
Copy link

At work we have noticed that one of our applications have started leaking memory. After diagnosing with VisualVM over a period of time, I have noticed that the heap size of io.prometheus.client.CKMSQuantiles$Item is gradually increasing over time without it ever lowering.

Using VisualVM, here is a screenshot of the first snaphot that we did
screen shot 2018-09-10 at 16 38 05
And here is the second, around an hour later
screen shot 2018-09-10 at 16 39 28
As you can see, the amounts of int's have also increased which seem to correspond to the members of the Quantiles Item class.

Although its expected that the number of items in the quantile increase, in our case the memory is never freed and our applications actually restart due to OOM roughly once a day. Our bucket size by default is 10 minutes, so it should be freeing up the memory shortly after that point.

We are btw using https://developer.lightbend.com/docs/cinnamon/current/plugins/prometheus/prometheus.html which uses client_java under the hood, you can also see the default configuration settings.

@brian-brazil
Copy link
Contributor

Hmm, that's not good. Did you wait at least 20m before taking the first snapshot? Does the growth continue similarly at about 3MB per hour?

@mdedetrich
Copy link
Author

mdedetrich commented Sep 10, 2018

I waited one hour before doing the first snapshot, and yes it does seem to continuously grow at a consistent rate.

I will double check this now as I am starting up some new nodes, will give an update tomorrow (it also doesn't appear to be entirely deterministic so I am clarifying this)

@mdedetrich
Copy link
Author

@brian-brazil I am doing another run now, unfortunately it didn't work last night because K8s terminated our pod halfway through.

BTW, the leak is more than 3mb per hour. 3mb is just the CKMSQuantiles$Item class, but the int's are also increasing (since they are contained within the class), in the above screenshot this is roughly 19 megs (for a total of ~22 megs per hour). 22 * 24 = 528 and our pods have 1.5 gigs of memory allocated which does roughly coincide with how often they seem to be shutting down due to OOM

@mdedetrich
Copy link
Author

Okay so last update, we don't think its due to client_java. There were issues with our JVM pods when it came to GC (within the context of Docker) that was causing the JVM to go OOM. Will close the issue and reopen if I find out its actually due to client_java

@kakaroto2
Copy link

Okay so last update, we don't think its due to client_java. There were issues with our JVM pods when it came to GC (within the context of Docker) that was causing the JVM to go OOM. Will close the issue and reopen if I find out its actually due to client_java

@mdedetrich can u describle the issues with your JVM pods? I've been meeting the same problem.

@mdedetrich
Copy link
Author

DieBauer added a commit to DieBauer/client_java that referenced this issue Jan 22, 2022
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak.
This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the behaviour is correct.
I have also added a Benchmark in the benchmark module showing the difference with the old and current
implementation.

According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.
DieBauer added a commit to DieBauer/client_java that referenced this issue Jan 22, 2022
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak.
This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the behaviour is correct.
I have also added a Benchmark in the benchmark module showing the difference with the old and current
implementation.

According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

Signed-off-by: Jens <jenskat@gmail.com>
fstab pushed a commit that referenced this issue Jan 30, 2022
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak.
This leak has been noticed several times (#422, #550, #654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the behaviour is correct.
I have also added a Benchmark in the benchmark module showing the difference with the old and current
implementation.

According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

Signed-off-by: Jens <jenskat@gmail.com>
fstab pushed a commit that referenced this issue Jan 30, 2022
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak.
This leak has been noticed several times (#422, #550, #654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the behaviour is correct.
I have also added a Benchmark in the benchmark module showing the difference with the old and current
implementation.

According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

Signed-off-by: Jens <jenskat@gmail.com>

Fix assertion in HistogramTest

Median of 1..11 = 6

Signed-off-by: Jens <jenskat@gmail.com>

Address PR remarks

Signed-off-by: Jens <jenskat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants