-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CKMSQuantiles and address memory leak #755
Conversation
CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made. Signed-off-by: Jens <jenskat@gmail.com>
Median of 1..11 = 6 Signed-off-by: Jens <jenskat@gmail.com>
Thanks a lot for putting in the effort and reading through the The PR contains some refactoring and some changes in functionality. I'm trying to understand the functional changes. The "hack" you mention in the original implementation is this: // NOTE: according to CKMS, this should be count, not size, but this
// leads
// to error larger than the error bounds. Leaving it like this is
// essentially a HACK, and blows up memory, but does "work".
// int size = count;
int size = sample.size();
double minError = size + 1; Your PR changes it to this: int n = count;
double minError = count; So basically you enabled the line The other functional changes seem to be in I understand that you aligned the implementation more closely with the pseudo-code in the paper, and wrote tests to prove that this fixes the "too large errors" issue. I'm wondering: Can you pinpoint what exactly is wrong in the original implementation? Maybe it's not possible to say that, but it would be awesome to know what the error was and how your change fixes it. |
Yes I think the PR can be split in 3 parts. Then for the actual issue of using the The compress method iterates over the TLDR; do not depend on mutable global state Example: Given that we have a uniform distribution of [1, 500] and that we are interested in the p99 within 1%. We start compressing elements by looping over the sample.listIterator(). We know that all elements in sample are ordered on value. I don't have this written down in math, but you can convince yourself that having this moving We jump right into the merging, have rank 1, and calculate a value based on the length of the iterator (0.99 * 999). At some point it is just not possible to match a given item with an error bound on this 'moving' sample.size(), while for example it should be merged given the original sample size. (I have checked that the size does indeed become better (not as good as it is now) when you pass a 'size' to the allowableErrors method, that does not change in the iteration). (note that there is also a All of this is fixed if you do not take the rank in the sample into account, but the rank that that sample represents and not the sample size but the fixed 'observed values' size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot again for the fix the explanation. I took the time to read the paper, and I think your changes are correct.
I have a couple of minor code review comments, but none of them is related to the algorithm itself.
Please also remove CKMSQuantilesOld
, no need to keep that.
*/ | ||
class CKMSQuantiles { | ||
public final class CKMSQuantiles { /* public for benchmark-module */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it package private, as the benchmark is in the same package that should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the benchmark is in a subpackage of client
: io.prometheus.client.benchmark
so not accessible when not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'll move the benchmark up. Let me know how you want to proceed.
simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java
Outdated
Show resolved
Hide resolved
this.quantiles = quantiles; | ||
// hard-coded epsilon of 0.1% to determine the batch size, and default epsilon in case of empty quantiles | ||
double pointOnePercent = 0.001; | ||
if (quantiles.length == 0) { // we need at least one for this algorithm to work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should rather throw an IllegalArgumentException
in that case than using default quantiles
. If this constructor is called with an empty array that is almost certainly a bug and not on purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I mainly have added this because in the comments of summary it states:
final List<Quantile> quantiles; // Can be empty, but can never be null.
I'm not sure what changes when this restriction is applied to users of Summaries that don't have specific quantiles applied to the summary.
simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java
Outdated
Show resolved
Hide resolved
simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java
Outdated
Show resolved
Hide resolved
simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java
Outdated
Show resolved
Hide resolved
simpleclient/src/main/java/io/prometheus/client/CKMSQuantiles.java
Outdated
Show resolved
Hide resolved
* g_i is the difference between the lowest | ||
* possible rank of item i and the lowest possible rank of item | ||
* i − 1 | ||
*/ | ||
public int g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not your code, but g
should be final
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This g
is written to in the compress method. next.g += prev.g;
. I will remove the public from the members in Quantile
and Item
though.
Thanks for taking the time and effort to review this! I'll address the comments. Let me know what the git commit strategy is (squash/rebase, ..) |
I like rebase and squash, so that we have a linear history with a single commit per PR. |
Signed-off-by: Jens <jenskat@gmail.com>
Thank you! |
FYI I found a little bug in the million entries test after I merged the PR: Collections.shuffle(Arrays.asList(shuffle), rand);
With an actual random array it turns out that the corner case I fixed the test and changed the constructor of |
I found another bug, so I reverted the PR on The bug is: If you take
I'm not sure at the moment what causes this. |
If you find what's going wrong, feel free to create a PR for a fix against the |
I have looked into this. In the case of only 1 quantile, the I have run the same unit-test suite against the 'old' implementation and multiple tests fail, also the 'edge'-case you mentioned. I have a fix for the quantiles of 1.0 and 0.0. These should be allowed. The compress function was too eager and should not have compressed the minimum observed value. |
I benchmarked the length of the buffer (1 / 2 e) given an epsilon. Adding a million elements gives this.
So with an epsilon of 1 percent, the buffersize is 50, while at 0.01% this is 5000. There the time it takes to sort the buffer kicks in. The arrays.sort benchmark looks like this:
Arrays.sort grows faster with a larger number of elements than 10 * a smaller number of elements. Now in the insert benchmark, there is also the hidden cost of a smaller epsilon, thus the samples.size() grows faster to gain the precision required. I think it's best to have a buffer size dependent on the epsilon, to cater for the case where an epsilon is way bigger than the hard-coded one, and max it at 500, to not have too much performance degradation. |
Shouldn't the algorithm be correct independent of the |
I'll try to see if I can get a better answer. The insertThreshold is indeed for compressing, when to compress or not depends on the amount of observed values |
I need to look into this statement in section 6.2: Space Usage for Targeted Quantiles
|
Found it. was indeed a bug in the implementation, a classic off-by-one error ;). Thanks for the critical look! The error was in taking into account a new item in the insertBatch() method, near the end we do: Item newItem = new Item(v, delta);
it.add(newItem);
count++;
item = newItem; The newly created item is added in the correct position in the list, with Section 4.1 Proof mentions
Since g is 1 by default we can add the newly created item.g, like this: I'll update the PR. |
Hi, looks like we were both working on it at the same time. I found this as well, and a couple of other issues (subtle rounding errors in the error function, which sounds small, but once a sample is at a wrong position this affects future inserts; doing compress from right to left instead of left to right; etc). I also added much more fine grained tests that go through the list of samples after each insert() and compress() and make sure that the invariants defined in the paper are satisfied. That way I found some issues that are hard to see if you just test the end result. I also realized that if you read that code, you need to have the paper next to you. So I removed comments that are just paragraphs from the paper, because people will read them in the paper anyway and not in the Javadoc. I also renamed a few things to be aligned with the names in the paper, like I pushed a commit where my tests are green, but I guess it would be good to have another look. And I did not run any benchmarks. BTW regarding the 0.0 and 1.0 quantiles: I did not look at this, but the question is what happens if a user just specifies a 0.0 quantile to track the minimum or a 1.0 quantile to track the maximum, and no other quantiles? |
Alright, great to hear! I can imagine the amount that has to go in to testing this properly, which I skipped because of the 'ballpark' figures I got from the high-level tests. Let me know where I can have a look at your changes! Regarding only defining a quantile with 0 or 1, yes these are indeed edge cases of min and max in the value stream. |
I thought about the |
I merged it once more, thanks a lot for your contribution! |
Nice. I did benchmarking on the current implementation, insertion is constant (within error bounds) while in the old implementation it was linear.
Lookups are 1 or 2 μs for p95 with 1 million elements.
This is because the sample size also increases with the number of elements.
So it is expected that looking up the same percentile in a larger list takes more time. One small improvement I see is in the This change is worthwhile since Here's a benchmark result when removing the math.floor call.
And the f method itself: Count = 10000 and r = 9500 (and 4 quantiles to loop over)
I'll create a PR for that. |
It's released 🎉. Thanks again for this awesome PR! |
CKMSQuantiles is copied from an implementation in 2012, where it states that a ‘HACK’ was done, admitting a space leak. This is observed by umbrant/QuantileEstimation#2, but never addressed.
This leak has been noticed several times in Prometheus context (#422, #550, #654).
By correctly applying the algorithm from the paper we fix the leak.
I have added unit-tests to show that the class's behaviour is correct.
I have also added a benchmark in the benchmark module showing the difference with the old (moved to the benchmark module) and current implementation.
According to my benchmarks, is in the new implementation a
get
of a quantile that has ‘seen’ 1 million elements 440 times faster.Inserting 1 million elements is 3.5 times faster.
The amount of samples needed to keep its accuracy (within error bounds) is 80 times less than the previous implementation, and according to manual testing more or less constant, while the old implementation grows with the number of observed items (hence the space leak comment).
While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.
edit: added benchmark results.
Benchmark results