-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperloglog++ bias correction appears to be flawed #508
Comments
Thanks @stevedrip for the thorough bug report! |
I can confirm that the bias correction we use is subtly different than the ones in HLL++ appendix. (Which is strange, I thought I compared them) |
@WireBaron do you have the time to open a PR migrating to the appendix's correction data? I think it should wait until after PR #509 is merged as that should have a much larger impact, but I'm curious about the results. |
509: Fix typos and possibly a bug? r=WireBaron a=BenSandeen Fix a typo in comments, fix some whitespace, and possibly fix a bug?? I haven't been able to get the tests running, so I'm sorry if this fails tests This is an attempt to resolve this issue: #508 Co-authored-by: BenSandeen <12025856+BenSandeen@users.noreply.github.com>
Relevant system information:
Describe the bug
At the inflection point between sparse and dense representations for each bucket size, error rates can vary by 50% or more.
To Reproduce
At
2^12
:At
2^14
:With detailed logs:
which produces:
Additional context
Original slack thread:
https://timescaledb.slack.com/archives/C02LXMKJMMJ/p1661285484474549
Possibly related to:
#469
Evidence that suggests timescaledb supports hyperloglog++:
timescaledb-toolkit/crates/hyperloglog/src/lib.rs
Lines 54 to 60 in f15910a
A resource that explains how hyperloglog++ should perform compared to hyperloglog after introducing a bias estimate:
https://blog.acolyer.org/2016/03/17/hyperloglog-in-practice-algorithmic-engineering-of-a-state-of-the-art-cardinality-estimation-algorithm/
Inflection points:
timescaledb-toolkit/crates/hyperloglogplusplus/src/hyperloglog_data.rs
Lines 7 to 23 in f15910a
Bias estimation values (it is possible we are choosing the wrong values from this vector to correct biases):
timescaledb-toolkit/crates/hyperloglogplusplus/src/hyperloglog_data.rs
Line 2393 in f15910a
It appears that the bug may actually exist in this repo, from which this code was copied:
https://github.com/crepererum/pdatastructs.rs/blob/main/src/hyperloglog.rs
The text was updated successfully, but these errors were encountered: