Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperloglog++ bias correction appears to be flawed #508

Closed
stevedrip opened this issue Aug 23, 2022 · 3 comments · Fixed by #531
Closed

Hyperloglog++ bias correction appears to be flawed #508

stevedrip opened this issue Aug 23, 2022 · 3 comments · Fixed by #531
Labels
bug Something isn't working

Comments

@stevedrip
Copy link

Relevant system information:

  • OS: Ubuntu 20.04/timescaledb-ha:pg14-latest
  • PostgreSQL version: 13.7/14.4
  • TimescaleDB Toolkit version: 1.7.0
  • Installation method: apt/docker

Describe the bug
At the inflection point between sparse and dense representations for each bucket size, error rates can vary by 50% or more.

To Reproduce
At 2^12:

> SELECT distinct_count(hyperloglog((2 ^ 12)::int, v)) logs FROM generate_series(1, 10000) v;
 logs
-------
 10124
(1 row)
> SELECT distinct_count(hyperloglog((2 ^ 12)::int, v)) logs FROM generate_series(1, 5000) v;
 logs
------
 6063
(1 row)

At 2^14:

> SELECT distinct_count(hyperloglog((2 ^ 14)::int, v)) logs FROM generate_series(1, 15000) v;
 logs
-------
 20463
(1 row)
> SELECT distinct_count(hyperloglog((2 ^ 14)::int, v)) logs FROM generate_series(1, 10000) v;
 logs
------
 9970

With detailed logs:

do
$$
declare
  result integer;
  stderr float;
begin
  for f in
    11562..11566
  loop
    select stderror(hyperloglog((2 ^ 14)::int, s)), distinct_count(hyperloglog((2 ^ 14)::int, s)) logs from generate_series(1, f * 1) as s into stderr, result;
    raise notice 'f: % r: %, e: % %%: %', f * 1, result, stderr, result::float / (f * 1);
  end loop;
end;
$$;

which produces:

NOTICE:  f: 11562 r: 11497, e: 0.008125 %: 0.9943781352707144
NOTICE:  f: 11563 r: 11499, e: 0.008125 %: 0.9944651042117097
NOTICE:  f: 11564 r: 18258, e: 0.008125 %: 1.5788654444828778
NOTICE:  f: 11565 r: 18258, e: 0.008125 %: 1.5787289234760051
NOTICE:  f: 11566 r: 18259, e: 0.008125 %: 1.5786788863911465

Additional context
Original slack thread:
https://timescaledb.slack.com/archives/C02LXMKJMMJ/p1661285484474549

Possibly related to:
#469

Evidence that suggests timescaledb supports hyperloglog++:

/// # Implementation
/// - The registers always allocate 8 bits and are not compressed.
/// - No sparse representation is used at any point.
/// - A 64 bit hash function is used (like in HyperLogLog++ paper) instead of the 32 bit hash
/// function (like in the original HyperLogLog paper).
/// - Bias correction is applied and the data is currently just taken from the HyperLogLog++ paper
/// appendix.

A resource that explains how hyperloglog++ should perform compared to hyperloglog after introducing a bias estimate:
https://blog.acolyer.org/2016/03/17/hyperloglog-in-practice-algorithmic-engineering-of-a-state-of-the-art-cardinality-estimation-algorithm/

Inflection points:

pub(crate) const THRESHOLD_DATA_VEC: &[usize] = &[
10, // b = 4
20, // b = 5
40, // b = 6
80, // b = 7
220, // b = 8
400, // b = 9
900, // b = 10
1800, // b = 11
3100, // b = 12
6500, // b = 13
11500, // b = 14
20000, // b = 15
50000, // b = 16
120000, // b = 17
350000, // b = 18
];

Bias estimation values (it is possible we are choosing the wrong values from this vector to correct biases):

It appears that the bug may actually exist in this repo, from which this code was copied:
https://github.com/crepererum/pdatastructs.rs/blob/main/src/hyperloglog.rs

@stevedrip stevedrip added the bug Something isn't working label Aug 23, 2022
@jerryxwu
Copy link
Contributor

Thanks @stevedrip for the thorough bug report!

@JLockerman
Copy link
Contributor

JLockerman commented Aug 25, 2022

I can confirm that the bias correction we use is subtly different than the ones in HLL++ appendix. (Which is strange, I thought I compared them)

@JLockerman
Copy link
Contributor

@WireBaron do you have the time to open a PR migrating to the appendix's correction data? I think it should wait until after PR #509 is merged as that should have a much larger impact, but I'm curious about the results.

bors bot added a commit that referenced this issue Sep 6, 2022
509: Fix typos and possibly a bug? r=WireBaron a=BenSandeen

Fix a typo in comments, fix some whitespace, and possibly fix a bug??

I haven't been able to get the tests running, so I'm sorry if this fails tests

This is an attempt to resolve this issue: #508

Co-authored-by: BenSandeen <12025856+BenSandeen@users.noreply.github.com>
@bors bors bot closed this as completed in 3515244 Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants