hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

jerryxwu · 2022-07-13T21:24:16Z

Relevant system information:

Timescale Cloud
PostgreSQL version (output of SELECT version();): 14.4
TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.7.0

Describe the bug
Running this query with different buckets and true_cardinality combinations

SELECT distinct_count(hyperloglog(<buckets>, cardinality)) 
FROM generate_series(1, <true_cardinality>) AS test_cardinality (cardinality);

produces these results:

To Reproduce
See above.

Expected behavior
For a fixed given precision (say 8,192 (2^13) which is something that you can expect in a continuous aggregation definition, when the true cardinality in the dataset varies, you would expect hyperloglog produce comparable error rates.

Actual behavior
For a fixed precision, using 8,192 (2^13)) as an example again, when the true cardinality of the data set approaches 7,000, the error rate skyrockets to over 42%. The error rate slowly comes down to an acceptably level only after the true cardinality approaches 2x of the number of buckets used. Since the true cardinality of a dataset is almost always unknown ahead of time, the current hyperloglog implementation probably has very limited use cases.

BTW, error bounds like this would make a lot of sense.

The text was updated successfully, but these errors were encountered:

jerryxwu added the bug Something isn't working label Jul 13, 2022

stevedrip mentioned this issue Aug 23, 2022

Hyperloglog++ bias correction appears to be flawed #508

Closed

WireBaron mentioned this issue Sep 9, 2022

Fix errors in hyperloglog implementation #531

Merged

bors bot closed this as completed in 3515244 Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

jerryxwu commented Jul 13, 2022 •

edited

Loading

hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

Comments

jerryxwu commented Jul 13, 2022 • edited Loading

jerryxwu commented Jul 13, 2022 •

edited

Loading