Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyperloglog produces unintuitively high error rate when the number of buckets is similar to the true cardinality #469

Closed
jerryxwu opened this issue Jul 13, 2022 · 0 comments · Fixed by #531
Labels
bug Something isn't working

Comments

@jerryxwu
Copy link
Contributor

jerryxwu commented Jul 13, 2022

Relevant system information:

  • Timescale Cloud
  • PostgreSQL version (output of SELECT version();): 14.4
  • TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.7.0

Describe the bug
Running this query with different buckets and true_cardinality combinations

SELECT distinct_count(hyperloglog(<buckets>, cardinality)) 
FROM generate_series(1, <true_cardinality>) AS test_cardinality (cardinality);

produces these results:

Screen Shot 2022-07-13 at 12 28 47 PM

To Reproduce
See above.

Expected behavior
For a fixed given precision (say 8,192 (2^13) which is something that you can expect in a continuous aggregation definition, when the true cardinality in the dataset varies, you would expect hyperloglog produce comparable error rates.

Actual behavior
For a fixed precision, using 8,192 (2^13)) as an example again, when the true cardinality of the data set approaches 7,000, the error rate skyrockets to over 42%. The error rate slowly comes down to an acceptably level only after the true cardinality approaches 2x of the number of buckets used. Since the true cardinality of a dataset is almost always unknown ahead of time, the current hyperloglog implementation probably has very limited use cases.

BTW, error bounds like this would make a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant