You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PostgreSQL version (output of SELECT version();): 14.4
TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): 1.7.0
Describe the bug
Running this query with different buckets and true_cardinality combinations
SELECT distinct_count(hyperloglog(<buckets>, cardinality))
FROM generate_series(1, <true_cardinality>) AS test_cardinality (cardinality);
produces these results:
To Reproduce
See above.
Expected behavior
For a fixed given precision (say 8,192 (2^13) which is something that you can expect in a continuous aggregation definition, when the true cardinality in the dataset varies, you would expect hyperloglog produce comparable error rates.
Actual behavior
For a fixed precision, using 8,192 (2^13)) as an example again, when the true cardinality of the data set approaches 7,000, the error rate skyrockets to over 42%. The error rate slowly comes down to an acceptably level only after the true cardinality approaches 2x of the number of buckets used. Since the true cardinality of a dataset is almost always unknown ahead of time, the current hyperloglog implementation probably has very limited use cases.
BTW, error bounds like this would make a lot of sense.
The text was updated successfully, but these errors were encountered:
Relevant system information:
SELECT version();
): 14.4\dx timescaledb_toolkit
inpsql
): 1.7.0Describe the bug
Running this query with different
buckets
andtrue_cardinality
combinationsproduces these results:
To Reproduce
See above.
Expected behavior
For a fixed given precision (say
8,192 (2^13)
which is something that you can expect in a continuous aggregation definition, when the true cardinality in the dataset varies, you would expecthyperloglog
produce comparable error rates.Actual behavior
For a fixed precision, using
8,192 (2^13)
) as an example again, when the true cardinality of the data set approaches 7,000, the error rate skyrockets to over 42%. The error rate slowly comes down to an acceptably level only after the true cardinality approaches2x
of the number of buckets used. Since the true cardinality of a dataset is almost always unknown ahead of time, the currenthyperloglog
implementation probably has very limited use cases.BTW, error bounds like this would make a lot of sense.
The text was updated successfully, but these errors were encountered: