You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
currently, the "number of distinct values" we feed to the bloom filter is the number of rows, which is too conservative (naive:).
we should use something like hyperloglog to estimate the NDV of give column and then feed the NDV to the bloom filter, for columns that have low cardinality, this will "shrink" the bloom filter index significantly.
This may make the bloom filter bitset smaller.
ClickHouse also has a bloom filter index, it seems to work like databend, I haven't checked his bitset size yet.
cc @drmingdrmer
Summary
currently, the "number of distinct values" we feed to the bloom filter is the number of rows, which is too conservative (naive:).
we should use something like hyperloglog to estimate the NDV of give column and then feed the NDV to the bloom filter, for columns that have low cardinality, this will "shrink" the bloom filter index significantly.
https://github.com/datafuselabs/databend/blob/f59efc05bc080e762846fbe8bac24c806e432a75/src/query/storages/index/src/bloom_filter.rs#L151-L162
may also be related to #7314
The text was updated successfully, but these errors were encountered: