Tunable error rate #19

nathanhack · 2020-09-10T19:19:38Z

It would be great to have a way to change/tune the false negative rate.

thomasmueller · 2020-09-10T19:50:15Z

I assume you mean false positive rate, right? I know some libraries provide functions to e.g. calculate the bits per entry for a false positive rate, or the false positive rate for a given bits per entry, or the size of the filter (memory used) from rate and entry count, and so on. Yes I agree it would be a good addition!

nathanhack · 2020-09-10T19:56:59Z

Sorry, yes, I meant false positive rate.

nathanhack · 2020-09-10T23:46:11Z

In particular, I'm interested in the case of setting a fixed size and letting the positive rate fluctuate as the keys are inserted.
@thomasmueller do you have any suggestion on how the repo might be changed to include such a feature? As time permits, I'll work on it and if/when I get it done I can put in a PR if that is something the repo owners would like.

thomasmueller · 2020-09-11T07:40:28Z

setting a fixed size and letting the positive rate fluctuate as the keys are inserted.

A xor filter, unlike a Bloom filter, is built once all keys are present. A Bloom filter allows to add entries afterwards, but a xor filter does not. So what you are looking for might not be possible I'm afraid...

nathanhack · 2020-09-11T13:12:42Z

Yeah to clarify, I indeed meant all at once, such that you set the size and let the error rate suffer.
After briefly looking at the paper, I would hazard a guess the main issue would be how to choose the hashing functions? Because, if I understand it correctly we basically keep trying new hashing until we get everything in, but with a fixed size this would always fail. If true then the simplest approach would be to keep track of the best set of hashes and at the end of MaxIterations, just use the best hashes. That way if the user wishes to continue trying their luck they can just increase MaxIterations. It seems like there should be some way to calculate the expected error rate based on the fixed size but I don't recall seeing that explicitly written down in the paper (maybe it was or maybe it could be derived but I honestly don't know it would require a through reading).

thomasmueller · 2020-09-11T15:28:19Z

I indeed meant all at once

Ah ok! Yes that works then.

you set the size and let the error rate suffer.

The implementation here (the current implementation) only supports 8 bit fingerprints (uint8). It doesn't support smaller or larger fingerprints right now. So I'm afraid this doesn't currently work... It's possible to add more implementations, but we believe it's not currently worth it... Of course feel free to provide a patch!

increase MaxIterations

No I'm afraid that's not a solution... What you could change a tiny bit is this line:

capacity := 32 + uint32(math.Ceil(1.23*float64(size)))

and, for example, try to use 1.229 instead of 1.23. It would work sometimes. But probability to find a seed reduces drastically, and very quickly... It's not something I would try. With a regular Bloom filter, yes you can just make it smaller and it will still "work" (just with a bad false positive rate), but not a xor filter.

brianolson · 2021-04-08T21:51:12Z

I implemented xor16 and I think I got it generalized also to xorN (for [9..32] bits).
https://github.com/brianolson/xorfilter/tree/xor16
There's some other stuff in there about maintaining a fork in case this work didn't make it back upstream, but that shouldn't be hard to un-diff.

lemire · 2021-06-18T15:24:23Z

@brianolson Your generic implementation seems to set the memory usage to 32-bit per entry. I understand that you can write only the least significant N bits of each word to disk or to the network, and then upsample when deserializing... but it comes at a cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tunable error rate #19

Tunable error rate #19

nathanhack commented Sep 10, 2020 •

edited

Loading

thomasmueller commented Sep 10, 2020

nathanhack commented Sep 10, 2020

nathanhack commented Sep 10, 2020

thomasmueller commented Sep 11, 2020

nathanhack commented Sep 11, 2020

thomasmueller commented Sep 11, 2020

brianolson commented Apr 8, 2021

lemire commented Jun 18, 2021

Tunable error rate #19

Tunable error rate #19

Comments

nathanhack commented Sep 10, 2020 • edited Loading

thomasmueller commented Sep 10, 2020

nathanhack commented Sep 10, 2020

nathanhack commented Sep 10, 2020

thomasmueller commented Sep 11, 2020

nathanhack commented Sep 11, 2020

thomasmueller commented Sep 11, 2020

brianolson commented Apr 8, 2021

lemire commented Jun 18, 2021

nathanhack commented Sep 10, 2020 •

edited

Loading