BUG: feed estimated number of distinct values to bloom filter #7825

dantengsky · 2022-09-23T05:07:18Z

Summary

currently, the "number of distinct values" we feed to the bloom filter is the number of rows, which is too conservative (naive:).
we should use something like hyperloglog to estimate the NDV of give column and then feed the NDV to the bloom filter, for columns that have low cardinality, this will "shrink" the bloom filter index significantly.

https://github.com/datafuselabs/databend/blob/f59efc05bc080e762846fbe8bac24c806e432a75/src/query/storages/index/src/bloom_filter.rs#L151-L162

may also be related to #7314

BohuTANG · 2022-09-23T05:40:00Z

This may make the bloom filter bitset smaller.
ClickHouse also has a bloom filter index, it seems to work like databend, I haven't checked his bitset size yet.
cc @drmingdrmer

dantengsky · 2022-09-27T02:28:19Z

vanilla bloom filter is replaced by xor filter in PR #7870

dantengsky added the C-feature Category: feature label Sep 23, 2022

BohuTANG mentioned this issue Sep 23, 2022

Tracking: Large dataset insert and read #7823

Closed

50 tasks

dantengsky closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: feed estimated number of distinct values to bloom filter #7825

BUG: feed estimated number of distinct values to bloom filter #7825

dantengsky commented Sep 23, 2022 •

edited

Loading

BohuTANG commented Sep 23, 2022

dantengsky commented Sep 27, 2022

BUG: feed estimated number of distinct values to bloom filter #7825

BUG: feed estimated number of distinct values to bloom filter #7825

Comments

dantengsky commented Sep 23, 2022 • edited Loading

BohuTANG commented Sep 23, 2022

dantengsky commented Sep 27, 2022

dantengsky commented Sep 23, 2022 •

edited

Loading