BloomFilter broken when really big #1119

gissuebot · 2014-10-31T17:35:46Z

Original issue created by Maaartinus on 2012-08-24 at 05:40 PM

**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than expected. The culprit is the modular arithmetic in BloomFilterStrategies.MURMUR128_MITZ_32.

You compute x%N where N=bits.size() and x is uniformly distributed in range 0..Integer.MAX_VALUE. For big N, x%N is far from uniform in range 0..N-1. For example, with N=3<<29, values below 1<<29 are twice as probable as the others.

This non-uniformity leads to results like this:
desiredFpp 0.000001000000
expectedFpp 0.000000610089
realFpp 0.000003000000

Here, desiredFpp is the value used in BloomFilter.create, expectedFpp was reported after exactly expectedInsertions were done. Obviously, much fewer bits than expected were set. If this happened once, it might be a good luck but here it's a sign of this bug, as the realFpp shows.

This problem is well reproducible, it's no glitch caused by bad luck with selected values. AFAIK it concerns all versions since the switch from powers of two.

**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than Integer.MAX_VALUE/64" another bug was introduced. The commit message is obviously wrong, as there can be allocated up to Integer.MAX_VALUE longs, allowing nearly 2**37 bits. However, the arithmetic is still int-based and allows to address only2**31 bits. So most of the allocated memory get wasted.

Even worse, bits.size() may overflow leading to all kinds of disaster, like "/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or using only 64 bits.

**INEFFICIENCY:
In MURMUR128_MITZ_32 there are one modulus operation and one unpredictable branch per hash function. This is quite wasteful, as it's enough to compute modulus for the basic two hashes and than use conditional subtraction.

**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the memory consumption.

**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be faster. In case the speed up is non-negligible, such a strategy makes a lot of sense, as the additional memory (assuming rounding up) is not wasted at all -- you get better FPP.

QUESTION:
I see no reason for limiting numHashFunctions to 25.5 In the SerialForm, there's an int, so why?

**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which

extracts two longs instead of two ints from the HashCode
uses long arithmetic for everything

The BitArray must use long indexes, long bitCount() and size(), too. This works with both strategies and the SerialForm needs no change.

For small filters (up to few millions bits, before the non-uniformity starts to make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all new filters.

In order to get maximum speed, the following comes in mind:

create package-private HashCode.asSecondLong()
compute hash2 only if numHashFunctions>1

The attached patch solves it all.

The text was updated successfully, but these errors were encountered:

gissuebot · 2014-11-01T00:32:29Z

Original comment posted by kurt.kluever on 2012-08-24 at 06:11 PM

(No comment entered for this change.)

Owner: andreou@google.com
Labels: Package-Hash, Type-Enhancement

gissuebot · 2014-11-01T00:37:19Z

Original comment posted by Maaartinus on 2012-09-19 at 09:08 PM

I wouldn't say that an arbitrary bad FPP and "/ by zero" is a RFE. It happens for big sizes only, but nobody said they aren't allowed.

gissuebot · 2014-11-01T00:37:21Z

Original comment posted by andreou@google.com on 2012-09-19 at 09:40 PM

Good suggestions, thanks for reading deeply the code! This didn't get nearly enough focused eyeballs yet, unfortunately.

Hmm, there was a version that I was extracting two longs instead of two ints from the hash, can't remember why I went with ints after all. I suspect I didn't see any fpp improvement from the longs, and went for simpler arithmetic, ignoring the case of really large BFs where longs are preferable anyway.

And yes, the modulo operator is unnecessarily performed too many times, well spotted!

The other bugs related to very large BFs are straightforward as well.

Guava team, can someone pick this up? Pretty much the patch sums it up, though probably without the HashCode.asSecondLong(), and the for-loops are written in a slightly weird way. But since the serial form is designed for changes like this, it's easy. It's a bit unlikely that anyone is affected by any of these cases, so it's not quite severe, but it's still nice to have this in a better shape anticipating for such cases.

Maartin, could you file a separate request for the "figure out the memory size of a BF"? I wish we had that as well, especially since there is no constructor that is configured by that.

gissuebot · 2014-11-01T00:38:14Z

Original comment posted by Maaartinus on 2012-10-02 at 10:19 AM

Without HashCode.asSecondLong() you have to either use HashCode.asBytes which means cloning or cast and access HashCode.bytes directly. Neither alternative feels nicer than asSecondLong.

Agreed that this bug unprobably hits someone soon.

The separate RFE is the issue 1156.

gissuebot · 2014-11-01T00:40:41Z

Original comment posted by andreou@google.com on 2012-10-15 at 07:33 PM

Kurt (Louis?), could someone take care of this? Inevitably more people are going to hit these bugs in very big bloomfilters

Owner: kak@google.com

gissuebot · 2014-11-01T00:42:08Z

Original comment posted by kak@google.com on 2012-10-23 at 04:47 PM

(No comment entered for this change.)

Status: Accepted

gissuebot · 2014-11-01T02:04:50Z

Original comment posted by spie...@addthis.com on 2014-03-06 at 11:50 PM

Can we revisit this issue? Here is a version of the patch that is updated against the latest git master branch. The majority of the implementation is faithful to the original patch. One small bugfix: in BitArray#get() and BitArray#put() the right shift operations should be unsigned. I also cleaned up the looping constructs to make them easier to understand. I introduced HashCode#asLeastSignificantLong() and HashCode#asMostSignificantLong() methods. The names of the methods are intended to be similar to the names of the methods in java.util.UUID. With the introduction of a new enum in BloomFilterStrategies these patches are backwards compatible. This fixes #1656

gissuebot · 2014-11-01T02:05:05Z

Original comment posted by kevinb@google.com on 2014-03-12 at 02:19 PM

(No comment entered for this change.)

Labels: Milestone-Release17

gissuebot · 2014-11-01T02:07:08Z

Original comment posted by kak@google.com on 2014-03-27 at 08:50 PM

OK, I think we've got a version we're going to check in. Can both spiegel@ and Maaartinus patch and test out this new BF strategy (patch against head, not Guava 16)? Also if you see anything obvious, please point that out as well.

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
      long bitSize = bits.bitSize();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

  boolean bitsChanged = false;
  long index = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    index &= Long.MAX_VALUE; // make it positive
    index %= bitSize; // make it indexable
    bitsChanged |= bits.set(index);
    index += hash2;
  }
  return bitsChanged;
}

@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
    int numHashFunctions, BitArray bits) {
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
  long bitSize = bits.bitSize();
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);

  long index = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    index &= Long.MAX_VALUE; // make it positive
    index %= bitSize; // make it indexable
    if (!bits.get(index)) {
      return false;
    }
    index += hash2;
  }
  return true;
}

private /* static */ long lowerEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

Please let me know.

Thanks,
-kak

Status: Started
Labels: -Type-Enhancement, Type-Defect

gissuebot · 2014-11-01T02:07:10Z

Original comment posted by kak@google.com on 2014-03-27 at 09:38 PM

Also we're having a bit of internal debate over the methodology of:
  if (index < 0) {
    index = ~index;
  }
vs.
  index &= Long.MAX_VALUE;

They both accomplish similar goals, but should one be preferable to the other? The branchless approach is more attractive runtime-wise, but it's not flipping the bits (just making the result positive).

Any thoughts?

gissuebot · 2014-11-01T02:07:12Z

Original comment posted by spie...@addthis.com on 2014-03-27 at 09:43 PM

Yes I don't think it matters what happens to the bits as long as the transformation doesn't mess up the distribution of values and that the result is always positive. Another thought I had was applying index >>>= 1. It removes the sign bit and drops the least significant bit. All three proposals are OK with me. I kind of prefer the two solutions that do not have a branch but for me that's not a big concern.

gissuebot · 2014-11-01T02:07:14Z

Original comment posted by Maaartinus on 2014-03-28 at 05:04 AM

Running spiegel@' BloomFilterTest says
Observed false positive rate: 1.4666666666666666E-7
Expected false positive rate:9.196321345133148E-8
which is much better than it was but still wrong.

I see no point in if (nextHash < 0) nextHash = ~nextHash;. It gets compiled into a conditional move or maybe into index ^= index >> 63, but IMHO it's not enough mixing for the buck. If the last mixing operation was a multiplication, the right shift would be clearly better than masking, but given how Murmur3_128Hasher.makeHash ends, I guess that the LSb and MSb are about equally good. Given that masking needs a 64-bit operand, I'd go for the shift.

Concerning the speed, there's something wrong:

normal: https://microbenchmarks.appspot.com/runs/7f9093a6-52c7-463d-8438-19f3295b4723

scaled: https://microbenchmarks.appspot.com/runs/85f26479-000a-4c06-82c5-cd4254013411#r:scenario.benchmarkSpec.parameters.desiredFpp,scenario.benchmarkSpec.parameters.expectedInsertions,scenario.benchmarkSpec.parameters.strategy&c:scenario.benchmarkSpec.methodName

It may be caused by the 64-bit division or by asBytes. Both could be eliminated and I'd bet that most of my optimizations for 32 bits would apply here as well. I guess I'll give it a try next week.

gissuebot · 2014-11-01T02:07:16Z

Original comment posted by Maaartinus on 2014-03-28 at 05:31 AM

There are some errors in my attached BloomFilterStrategies, especially many shifts must be unsigned. However, fixing it doesn't change the speed.

gissuebot · 2014-11-01T02:07:23Z

Original comment posted by cgdecker@google.com on 2014-03-28 at 04:27 PM

FYI, in the current iteration of this we're avoiding the asBytes() call with a package-private backdoor on HashCode that (for a byte-array based HashCode such as this one) returns the internal array directly without cloning.

gissuebot · 2014-11-01T02:07:27Z

Original comment posted by Maaartinus on 2014-03-28 at 05:03 PM

Nice. Still, my 64-bit version is by 30-50% faster than the original MURMUR128_MITZ_32 (which needs asLong() only). Currently it's too hacky, but I'll come back when I polish it a bit.

I found a nice modulus eliminating hack.

I think that the original BloomFilterTest was (slightly) wrong, with my updated version the false positive rates are fine for the 64-bit version.

gissuebot · 2014-11-01T02:07:33Z

Original comment posted by kak@google.com on 2014-03-28 at 06:51 PM

First, congrats on using nearly 30 minutes walltime of my 6-Core Xeon workstation with the GuavaBloomTest ;-)

Second, here's are my current results:

MURMUR128_MITZ_32
Desired false positive rate: 1.0E-75
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.83005825E-316
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.9992650567524934
Factor: Infinity

MURMUR128_MITZ_64
Desired false positive rate: 1.0E-75
Inserting 0%^[ 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.0003117180465422E-75
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

These numbers seem wrong. Is there something wrong with the tool?

gissuebot · 2014-11-01T02:07:35Z

Original comment posted by Maaartinus on 2014-03-28 at 07:23 PM

Oh sorry.... I fixed the formula but blew the inputs. Change it to something sane (even 1e-10 is fine, but 1e-75 was only an exercise with maximum memory).

I believe that the numbers are correct: MURMUR128_MITZ_32 is completely broken because of overflow. It sets only a few bits, so it reports unbelievable small expected fpp and end up with an terrible fpp.

For MURMUR128_MITZ_64 the expected fpp was 1e-75 and there was no single false positive. That's fine. With proper figures you'll see proper data. Sorry again.

gissuebot · 2014-11-01T02:07:38Z

Original comment posted by kak@google.com on 2014-03-28 at 07:37 PM

OK, I re-ran the tests with 1e-10. Glad to see we have a reproducible failure case for MITZ_32.

MURMUR128_MITZ_32 (what's currently committed in Guava):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.6110489235558792E-16
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0011919304361762886
Factor: 7398474489188.57

MURMUR128_MITZ_64 (my proposed fix, source is attached below):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998270307088567E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

  boolean bitsChanged = false;
  long combinedHash = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    // Make the combined hash positive and indexable
    bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
    combinedHash += hash2;
  }
  return bitsChanged;
}

@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
    int numHashFunctions, BitArray bits) {
  long bitSize = bits.bitSize();
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);

  long combinedHash = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    // Make the combined hash positive and indexable
    if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
      return false;
    }
    combinedHash += hash2;
  }
  return true;
}

private /* static */ long lowerEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

};

I realize yours may contain a few more small optimizations, but I'd like to prefer clarity and correctness over speed. If you can confirm that this fix is correct, we can discuss the tradeoffs between clarity and speed.

gissuebot · 2014-11-01T02:07:48Z

Original comment posted by spie...@addthis.com on 2014-03-28 at 07:54 PM

+1 this looks good. I'll rerun the tests and I expect them to pass. I'm going to run the latest version (#18) in a profiler and look for any bottlenecks.

gissuebot · 2014-11-01T02:07:51Z

Original comment posted by Maaartinus on 2014-03-28 at 08:33 PM

Looks good, maybe except for long combinedHash = hash1 + hash2;, where long combinedHash = hash1 would be both clearer and (a very tiny bit) faster (there's really no point in combining the longs as the Murmur's last two steps are h1 += h2; h2 += h1;).

The problem with later optimizations is serialization. When people start saving their filter, you can't really change it anymore. You can create a new strategy, but you should maintain them all forever. No nice implementation-independent SerialForm works here.

Unfortunately, there's no single big optimization here, just a couple of 10% speedup for 10 minutes headache trade-offs. The good news is that the new version is faster than the broken one: https://microbenchmarks.appspot.com/runs/5d98efb8-9013-460b-9c0f-a576e6d158ba

gissuebot · 2014-11-01T02:07:53Z

Original comment posted by kak@google.com on 2014-03-28 at 08:41 PM

I think I'm going to leave the extra addition in. The paper that's referenced in the Javadoc, "Less Hashing, Same Performance: Building a Better Bloom Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf) uses this formula:

gi(x) = h1(x) + i*h2(x)

Which is what we've got now. I'd rather leave it as is to match the paper (for clarity sake), than to tinker with it more.

Thanks for all the help Martin + Spiegel. Submitting shortly!

gissuebot · 2014-11-01T02:07:57Z

Original comment posted by Maaartinus on 2014-03-28 at 09:26 PM

Right, but they also write "In our context i will range from 0 up to some number k − 1 to give k hash functions, and ...", so their very first gi is h1 rather than h1+h2.

gissuebot · 2014-11-01T02:07:59Z

Original comment posted by kak@google.com on 2014-03-28 at 09:42 PM

Ugh, not sure how I missed that...made the appropriate change and re-ran the GuavaBloomTest. Results:

Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998725915640892E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

Status: Fixed

gissuebot mentioned this issue Nov 1, 2014

Bloom filter : changing expected insertions number from int to long #1067

Closed

gissuebot added type=defect Bug, not working as expected status=fixed migrated package=hash labels Nov 1, 2014

gissuebot assigned kluever Nov 1, 2014

gissuebot closed this as completed Nov 1, 2014

cgdecker modified the milestone: 17.0 Nov 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BloomFilter broken when really big #1119

BloomFilter broken when really big #1119

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

BloomFilter broken when really big #1119

BloomFilter broken when really big #1119

Comments

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 • edited by cgdecker Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 • edited by cgdecker Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 • edited by cgdecker Loading

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading

gissuebot commented Nov 1, 2014 •

edited by cgdecker

Loading