-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BloomFilter broken when really big #1119
Comments
Original comment posted by kurt.kluever on 2012-08-24 at 06:11 PM (No comment entered for this change.) Owner: andreou@google.com |
Original comment posted by Maaartinus on 2012-09-19 at 09:08 PM I wouldn't say that an arbitrary bad FPP and "/ by zero" is a RFE. It happens for big sizes only, but nobody said they aren't allowed. |
Original comment posted by andreou@google.com on 2012-09-19 at 09:40 PM Good suggestions, thanks for reading deeply the code! This didn't get nearly enough focused eyeballs yet, unfortunately. Hmm, there was a version that I was extracting two longs instead of two ints from the hash, can't remember why I went with ints after all. I suspect I didn't see any fpp improvement from the longs, and went for simpler arithmetic, ignoring the case of really large BFs where longs are preferable anyway. And yes, the modulo operator is unnecessarily performed too many times, well spotted! The other bugs related to very large BFs are straightforward as well. Guava team, can someone pick this up? Pretty much the patch sums it up, though probably without the HashCode.asSecondLong(), and the for-loops are written in a slightly weird way. But since the serial form is designed for changes like this, it's easy. It's a bit unlikely that anyone is affected by any of these cases, so it's not quite severe, but it's still nice to have this in a better shape anticipating for such cases. Maartin, could you file a separate request for the "figure out the memory size of a BF"? I wish we had that as well, especially since there is no constructor that is configured by that. |
Original comment posted by Maaartinus on 2012-10-02 at 10:19 AM Without Agreed that this bug unprobably hits someone soon. The separate RFE is the issue 1156. |
Original comment posted by andreou@google.com on 2012-10-15 at 07:33 PM Kurt (Louis?), could someone take care of this? Inevitably more people are going to hit these bugs in very big bloomfilters Owner: kak@google.com |
Original comment posted by kak@google.com on 2012-10-23 at 04:47 PM (No comment entered for this change.) Status: |
Original comment posted by spie...@addthis.com on 2014-03-06 at 11:50 PM Can we revisit this issue? Here is a version of the patch that is updated against the latest git master branch. The majority of the implementation is faithful to the original patch. One small bugfix: in BitArray#get() and BitArray#put() the right shift operations should be unsigned. I also cleaned up the looping constructs to make them easier to understand. I introduced HashCode#asLeastSignificantLong() and HashCode#asMostSignificantLong() methods. The names of the methods are intended to be similar to the names of the methods in java.util.UUID. With the introduction of a new enum in BloomFilterStrategies these patches are backwards compatible. This fixes #1656 |
Original comment posted by kevinb@google.com on 2014-03-12 at 02:19 PM (No comment entered for this change.) Labels: |
Original comment posted by kak@google.com on 2014-03-27 at 08:50 PM OK, I think we've got a version we're going to check in. Can both spiegel@ and Maaartinus patch and test out this new BF strategy (patch against head, not Guava 16)? Also if you see anything obvious, please point that out as well. MURMUR128_MITZ_64() {
Please let me know. Thanks, Status: |
Original comment posted by kak@google.com on 2014-03-27 at 09:38 PM Also we're having a bit of internal debate over the methodology of: They both accomplish similar goals, but should one be preferable to the other? The branchless approach is more attractive runtime-wise, but it's not flipping the bits (just making the result positive). Any thoughts? |
Original comment posted by spie...@addthis.com on 2014-03-27 at 09:43 PM Yes I don't think it matters what happens to the bits as long as the transformation doesn't mess up the distribution of values and that the result is always positive. Another thought I had was applying index >>>= 1. It removes the sign bit and drops the least significant bit. All three proposals are OK with me. I kind of prefer the two solutions that do not have a branch but for me that's not a big concern. |
Original comment posted by Maaartinus on 2014-03-28 at 05:04 AM Running spiegel@' BloomFilterTest says I see no point in Concerning the speed, there's something wrong: normal: https://microbenchmarks.appspot.com/runs/7f9093a6-52c7-463d-8438-19f3295b4723 It may be caused by the 64-bit division or by |
Original comment posted by Maaartinus on 2014-03-28 at 05:31 AM There are some errors in my attached BloomFilterStrategies, especially many shifts must be unsigned. However, fixing it doesn't change the speed. |
Original comment posted by cgdecker@google.com on 2014-03-28 at 04:27 PM FYI, in the current iteration of this we're avoiding the asBytes() call with a package-private backdoor on HashCode that (for a byte-array based HashCode such as this one) returns the internal array directly without cloning. |
Original comment posted by Maaartinus on 2014-03-28 at 05:03 PM Nice. Still, my 64-bit version is by 30-50% faster than the original MURMUR128_MITZ_32 (which needs I found a nice modulus eliminating hack. I think that the original BloomFilterTest was (slightly) wrong, with my updated version the false positive rates are fine for the 64-bit version. |
Original comment posted by kak@google.com on 2014-03-28 at 06:51 PM First, congrats on using nearly 30 minutes walltime of my 6-Core Xeon workstation with the GuavaBloomTest ;-) Second, here's are my current results: MURMUR128_MITZ_32 MURMUR128_MITZ_64 These numbers seem wrong. Is there something wrong with the tool? |
Original comment posted by Maaartinus on 2014-03-28 at 07:23 PM Oh sorry.... I fixed the formula but blew the inputs. Change it to something sane (even 1e-10 is fine, but 1e-75 was only an exercise with maximum memory). I believe that the numbers are correct: MURMUR128_MITZ_32 is completely broken because of overflow. It sets only a few bits, so it reports unbelievable small expected fpp and end up with an terrible fpp. For MURMUR128_MITZ_64 the expected fpp was 1e-75 and there was no single false positive. That's fine. With proper figures you'll see proper data. Sorry again. |
Original comment posted by kak@google.com on 2014-03-28 at 07:37 PM OK, I re-ran the tests with 1e-10. Glad to see we have a reproducible failure case for MITZ_32. MURMUR128_MITZ_32 (what's currently committed in Guava): MURMUR128_MITZ_64 (my proposed fix, source is attached below): MURMUR128_MITZ_64() {
}; I realize yours may contain a few more small optimizations, but I'd like to prefer clarity and correctness over speed. If you can confirm that this fix is correct, we can discuss the tradeoffs between clarity and speed. |
Original comment posted by spie...@addthis.com on 2014-03-28 at 07:54 PM +1 this looks good. I'll rerun the tests and I expect them to pass. I'm going to run the latest version (#18) in a profiler and look for any bottlenecks. |
Original comment posted by Maaartinus on 2014-03-28 at 08:33 PM Looks good, maybe except for The problem with later optimizations is serialization. When people start saving their filter, you can't really change it anymore. You can create a new strategy, but you should maintain them all forever. No nice implementation-independent SerialForm works here. Unfortunately, there's no single big optimization here, just a couple of 10% speedup for 10 minutes headache trade-offs. The good news is that the new version is faster than the broken one: https://microbenchmarks.appspot.com/runs/5d98efb8-9013-460b-9c0f-a576e6d158ba |
Original comment posted by kak@google.com on 2014-03-28 at 08:41 PM I think I'm going to leave the extra addition in. The paper that's referenced in the Javadoc, "Less Hashing, Same Performance: Building a Better Bloom Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf) uses this formula: gi(x) = h1(x) + i*h2(x) Which is what we've got now. I'd rather leave it as is to match the paper (for clarity sake), than to tinker with it more. Thanks for all the help Martin + Spiegel. Submitting shortly! |
Original comment posted by Maaartinus on 2014-03-28 at 09:26 PM Right, but they also write "In our context i will range from 0 up to some number k − 1 to give k hash functions, and ...", so their very first |
Original comment posted by kak@google.com on 2014-03-28 at 09:42 PM Ugh, not sure how I missed that...made the appropriate change and re-ran the GuavaBloomTest. Results: Desired false positive rate: 1.0E-10 Status: |
Original issue created by Maaartinus on 2012-08-24 at 05:40 PM
**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than expected. The culprit is the modular arithmetic in BloomFilterStrategies.MURMUR128_MITZ_32.
You compute
x%N
whereN=bits.size()
andx
is uniformly distributed in range0..Integer.MAX_VALUE
. For bigN
,x%N
is far from uniform in range0..N-1
. For example, withN=3<<29
, values below1<<29
are twice as probable as the others.This non-uniformity leads to results like this:
desiredFpp 0.000001000000
expectedFpp 0.000000610089
realFpp 0.000003000000
Here,
desiredFpp
is the value used inBloomFilter.create
,expectedFpp
was reported after exactlyexpectedInsertions
were done. Obviously, much fewer bits than expected were set. If this happened once, it might be a good luck but here it's a sign of this bug, as therealFpp
shows.This problem is well reproducible, it's no glitch caused by bad luck with selected values. AFAIK it concerns all versions since the switch from powers of two.
**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than Integer.MAX_VALUE/64" another bug was introduced. The commit message is obviously wrong, as there can be allocated up to
Integer.MAX_VALUE
longs, allowing nearly2**37
bits. However, the arithmetic is still int-based and allows to address only2**31
bits. So most of the allocated memory get wasted.Even worse,
bits.size()
may overflow leading to all kinds of disaster, like "/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or using only 64 bits.**INEFFICIENCY:
In
MURMUR128_MITZ_32
there are one modulus operation and one unpredictable branch per hash function. This is quite wasteful, as it's enough to compute modulus for the basic two hashes and than use conditional subtraction.**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the memory consumption.
**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be faster. In case the speed up is non-negligible, such a strategy makes a lot of sense, as the additional memory (assuming rounding up) is not wasted at all -- you get better FPP.
QUESTION:
I see no reason for limiting
numHashFunctions
to 25.5 In theSerialForm
, there's anint
, so why?**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which
The
BitArray
must use long indexes, longbitCount()
andsize()
, too. This works with both strategies and theSerialForm
needs no change.For small filters (up to few millions bits, before the non-uniformity starts to make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all new filters.
In order to get maximum speed, the following comes in mind:
HashCode.asSecondLong()
numHashFunctions>1
The attached patch solves it all.
The text was updated successfully, but these errors were encountered: