Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmSketch not benefitting from four rows #108

Closed
FrankReh opened this issue Dec 4, 2019 · 19 comments
Closed

cmSketch not benefitting from four rows #108

FrankReh opened this issue Dec 4, 2019 · 19 comments
Labels
optimization priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to work on it.

Comments

@FrankReh
Copy link

FrankReh commented Dec 4, 2019

I believe the commit f823dc4 broke the purpose of the cm sketch to have four rows. The bitwise-xor and mask are not creating the mix of indexes expected by the design. The same Increment/Estimate results, the same values, are achieved from a single row with or without the bitwise-xor. The earlier implementation seems to have been a good and very fast approximation to a distinct hash for each row except when the high 32 bits were all zero. One solution to fixing the earlier version could be to bitwise-or a single bit into the top 32 half. I can provide my testing and benchmarking on this offline if interested.

For small values of row length as used in the current unit tests, this doesn't matter. As the row length gets larger, the gap between the earlier algorithm and the current one widens and I believe becomes significant.

@ben-manes
Copy link

ben-manes commented Dec 5, 2019

Caffeine uses the 32-bit spread function derived by computed analysis, as described on StackOverflow. While Java only offers a 32-bit Object.hashCode(), the quality of this function and the explicitly chosen seeds has been very good. I suspect that the usage of random seeds here would degrade the distribution quality in comparison.

A similar analysis offering other recommended spreading functions is suggested by hash-prospector.

@FrankReh
Copy link
Author

FrankReh commented Dec 5, 2019

@ben-manes Thanks for your comment and your interest in this feature. Your work on Caffeine and with W-TinyLFU is remarkable and I look forward to adding an LFU window in front of my own version of this package soon.

This issue I tried to raise yesterday I think I did a bad job of explaining. The current code doesn't work. The fact that a 64 bit variable named hash is an input and that there are variables with names seed[0] through seed[3] and that XOR is being used easily obfuscates the problem - and why it's been in the code base two months now and why it got past a code review.

There's no bit swapping - there is no independence between the four row indexes.

The number of columns in the cm-sketch defines the number of bits used in
the bit mask. The same bit mask is used for all four rows. For convenience, the mask is of the low order bits so a simple bitwise-and makes the pseudo-random looking value fit into a power-of-2 index for the row.

In case further elaboration is helpful: the number of columns tends to be small compared to 64 bits in the input space. Say it is chosen as 16. The sketch has 16 columns. The current code uses the same 16 bits from the input variable each of the four times - the other 48 bit values are totally ignored. There is nothing that fiddles with bit positions. The fact that a randomly chosen value for each row is used to bitwise-xor those 16 bits doesn't make the second, third or fourth row more useful. The tiny counters they end up holding are copies of the first row, just swizzled around.

In terms of the original cm-sketch paper, there is no independence between the four indexes, they are not pairwise independent.

The earlier code that indexed the counters of each row made an attempt to use all 64 bits, a different way for each row, so at least an approximation of pairwise independence.

One scheme, if one knew the cm-sketch structure was going to be 16x4, would be to use a different 16 bits of the input for each row's index. But if you can't rely on the input being well distributed (and the current default function in ristretto does not distribute well for keys that are recognized by the compiler as bytes or 32bit or 64 bit variables, then this would cause many tiny counters to be maxed out while not using most of the rest. So a true hashing scheme should be used, even when the key is fewer than 8 bytes. The default function does use a true hashing function for golang strings and byte slices so at least for those two forms of key, the 64 bits given to the cm-sketch will be well distributed. (I would suggest the golang types that are not strings or byte slices be copied, byte for byte, into a byte slice, and then using the byte slice hashing function provided by golang or by xxhash).

Lots of counters and other 64 bit variables use just one or two bytes of non zero data while most of the high order bytes are zero, so not putting those through a real hash also breaks a requirement of the cm-sketch algorithm - distributing the input well over all the tiny counters (just 4 bits wide in Caffeine's and Ristretto's case). But this is starting to digress too far from the first problem - the fact that the four rows are being indexed with the same small set of bits from the input. It's interesting that nobody noticed, even during the code review process and I believe it has to do with the variable names and the fact that the original code also used bitwise-XOR - just not well enough for the case where the high 32 bits were all zero.

To prove this to myself - because I wouldn't want to be wrong with a forum like this, I wrote unit tests with a evenly distributed pseudo-random input set and found the cm-sketch can be reduced to one row and the Estimate values it returns were identical to the four row version. Again, because higher order bits of the input are not used - the same small bit set are used for each index calculation and the latter three rows are reduced to being redundant.

I still hesitate pasting the code here. It is easily seen by following the GitHub provided link to the commit mentioned above.

@ben-manes
Copy link

Thanks @FrankReh.

I cannot speak to the Golang implementation, as it has many differences to the Java library that I wrote. Similar in concepts, yet different in all implementation choices. From glancing at its code, I can see why there would be concerns with the quality of its result.

Can you try evaluating with either (a) the changes that I suggested or (b) against the Java version? I believe that the Java version has a better distribution of the bits. It first scrambles the input hash using a verified good hash spreader. From this we use the lower bits to choose the four rows (out of 16) to operate against. It then uses a fixed set of seeds that are known to cause good, distinct spreads to apply against the scrambled hash. The top and bottom portion of the result is added together before modulating for the counter index. When running against various traces this resulted in cache hits rates very similar to if a 64-bit off-the-shelf CMS library was used instead.

The Java version would still have the problems of not being pairwise independent, but hopefully it has a better approximation that is good enough for our needs. But if it too has this problem then I'd like to incorporate your test case and apply a fix.

static final long[] SEED = { // A mixture of seeds from FNV-1a, CityHash, and Murmur3
    0xc3a5c85c97cb3127L, 0xb492b66fbe98f273L, 0x9ae16a3b2f90404fL, 0xcbf29ce484222325L};

/**
 * Returns the table index for the counter at the specified depth.
 *
 * @param item the element's hash
 * @param i the counter depth
 * @return the table index
 */
int indexOf(int item, int i) {
  long hash = SEED[i] * item;
  hash += hash >>> 32;
  return ((int) hash) & tableMask;
}

/**
 * Applies a supplemental hash function to a given hashCode, which defends against poor quality
 * hash functions.
 */
int spread(int x) {
  x = ((x >>> 16) ^ x) * 0x45d9f3b;
  x = ((x >>> 16) ^ x) * 0x45d9f3b;
  return (x >>> 16) ^ x;
}

@FrankReh
Copy link
Author

FrankReh commented Dec 5, 2019

@ben-manes
Thanks for the help. The 64 bit indexOf function works very well for four rows and 16 columns. Even better if I applied the spread to the 64 bits first. But what's really cool is that hash code generator you pointed us to written up at https://nullprogram.com/blog/2018/07/31/

His description of a low biased and perfectly invertible hash that his JIT program builds and evaluates is really something. I've been using it on a single core for the last couple hours and it has already spit out a version that works the best in all my tests so far. Far better than the original hash used in this package and better than any I had cobbled together. I have two test cases I'm using for comparisons, one that records when the first false estimate occurs just before a new item is added to the sketch, and one that records once .1% of the new items added to the sketch experienced false estimates.

For one test case, your indexOf(spread(x)) came out a little ahead. For the other, the JIT generated algorithm came out a little ahead. I'm going to let the JIT keep chugging to see if it finds something that by its definition is better. The score it gives the current one is a 21, which is high by its standards. It means the hash is too biased, favoring some bit flips over others, and that's just the kind of thing we want to smooth out when dealing with 4 bit counters that are easily maxed out.

But I've been limiting my tests today to cm-sketch structures that have exactly 64 counters in them, so 4 rows of 16 columns, or 8by8 or even 16by4, so I get a nice one bit per counter mapping. When I expand the tests to again use wider rows, I'll need more than 64 bits total to index them and your IndexOf function provides that nicely (256 bits). So unless the JIT finds something extra special or I figure out how to modify it your function looks like the right choice.

The other idea I'm mulling about is looking at the JIT code and see if I can tweak it to rule out poorly biased guesses more quickly so it can look at 256 to 256 bit hashes. Probably beyond the time I have but these are interesting tools and the ability to generate code to find a better fit for something is very useful.

And to save on run time costs, your approach of using a few bits after the spread to select which four rows to work on sounds very nice. I may settle on 8 rows, despite the slight added overhead because tests yesterday showed a marked benefit if I halved the columns but doubled the rows. But that was around an inflection point because doubling the rows and halving the columns one more time degraded the results - I think because it became too easy to max out a 4 bit counter. But your approach of having only a subset of rows in play for a given hash is nice. Yet another level of indirection that gives a nice time for space tradeoff.

But alas, another factor will have to be what these things do to the actual L1 and L2 cache. Too many bytes allocated to the cm-sketch and the read and write accesses to those counters will make the hardware caching thrash. But first things first - a hashing scheme that uses the 4 bit counters effectively over 4 or 8 rows. Thanks!

Already long winded so ...

Taken from the hash composer document, an important pair of goals of these kinds of hash are

  • High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.
  • Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

I've written something that let's me see the avalanche and bias for each algorithm, one input bit at a time so there are only 64 rows per scheme to look at. A lot of the schemes we've been using or looking at are highly biased. Especially the ones that don't come with magic 64 bit seed values to create a good mix right off the bat.

@ben-manes
Copy link

ben-manes commented Dec 5, 2019

Thanks for the detailed analysis, @FrankReh. I'm no expert in hashing but do enjoy reading about other's findings and @thomasmueller is to thank for the scheme used in Caffeine. His description and the program he wrote to discover the constants are described in the StackOverflow post that I linked to above. For whatever the end result is, I'd be interested in seeing your code and porting any test cases to our frequency sketch to help ensure it retains a high quality scheme.

@thomasmueller
Copy link

thomasmueller commented Dec 11, 2019

Hi - I agree with both @FrankReh and @ben-manes. The blob post https://nullprogram.com/blog/2018/07/31/ is very good.

If possible, I think it makes sense to use 64-bit spread / mix functions now. Back when I was testing this, the 32-bit spread function (in Java) was still a bit faster, but now I think it no longer is. I suggest to use the Murmur 3 64-bit finalizer:

h ^= h >> 33;
h *= 0xff51afd7ed558ccd;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53;
h ^= h >> 33;

That is C / C++ code, Java code would be for example:

x = (x ^ (x >>> 33)) * 0xff51afd7ed558ccdL;
x = (x ^ (x >>> 33)) * 0xc4ceb9fe1a85ec53L;
x = x ^ (x >>> 33);

If you want to add a seed to that, then it is sufficient to add that to x, for example (Java):

public static long hash64(long x, long seed) {
    x += seed;
    x = (x ^ (x >>> 33)) * 0xff51afd7ed558ccdL;
    x = (x ^ (x >>> 33)) * 0xc4ceb9fe1a85ec53L;
    x = x ^ (x >>> 33);
    return x;
}

@karlmcguire
Copy link
Contributor

karlmcguire commented Dec 13, 2019

Thank you for the detailed writeup @FrankReh and input from everyone else.

The broader issue mentioned is that, because we are accepting interface{} keys, the default z.KeyToHash function only guarantees a well distributed output for underlying string and []byte inputs. I've been considering making the public API require []byte keys for other reasons (#107), so this issue has pretty much made up my mind. Doing so will make the solution to this issue infinitely easier, in my opinion.

Now that we can assume a well distributed uint64 hash, @FrankReh's idea of using independent 16-bit segments for finding the counter index on each row is appealing to me because of its simplicity. I've written up a PoC here and it appears to be benefitting from the rows.

func (s *Sketch) Increment(hash uint64) {
    for i := range s.blocks {
        // fibonacci hashing 16-bit segment of the hash to get a different
        // block index on each row, reduce collisions
        mixed := fib.Hash(hash << (i * 16) >> 48)
        // right shift based on the size of the row
        block := &s.blocks[i][mixed>>s.offset]
        // shift determines whether we use the left or right half of the block
        shift := (mixed & 1) * 4
        if (*block>>shift)&0x0f < 15 {
            *block += 1 << shift
        }
    }
}

I'm using Fibonacci Hashing which is fast and provides a pretty good distribution for how simple it is. I'm not remotely close to an expert on hashing but this seems like a good solution.

I'm going to read some of the things suggested here and write some unit tests for Ristretto, then fix this issue.

@thomasmueller
Copy link

I see, however I think you can do even better (faster and more secure). "Fibonacci hashing is not a really good hash function" (this is a quote from the article you linked to), so it won't protect you from bad hash codes, and for good hash codes, it's not needed: For the case you mentioned, it's fast to generate a good 64-bit hash value first, then split it into 2 halves, and then in the loop just add. Same as done in the Google Guava Bloom filter, and here:
https://github.com/FastFilter/fastfilter_cpp/blob/master/src/bloom/bloom.h#L120

// good hash function (e.g. Murmur 3 64-bit finalizer)
int64_t hash = hasher(key);
// split in 2 halves
uint32_t a = (uint32_t)(hash >> 32);
uint32_t b = (uint32_t)hash;
for (int j = 0; j < k; j++) {
  int index = reduce(a, this->arrayLength);
  ...
  a += b;
}

The "reduce" part you can skip (it allows to use non-power-of-2 sizes). But the main loop is:

for i := range s.blocks {
    mixed := mixed + b

@FrankReh
Copy link
Author

@karlmcguire @thomasmueller Is there a concern the hash isn't good enough at distributing the entropy across all the row counters somewhat evenly? Or is the extra fibonacci hash or idea of adding the upper bits to the lower bits to solve the problem of needing fewer than 64 bits for the total block indexing?

@karlmcguire
Copy link
Contributor

Is there a concern the hash isn't good enough at distributing the entropy across all the row counters somewhat evenly?

Yes, this was my concern. If we're allowing people to provide their own KeyToHash I'd be more comfortable if there was an extra mixing function with low overhead (doesn't necessarily need to be a good hash function, just good enough at mixing a hash somewhat randomly).

@thomasmueller I agree that Fibonacci Hashing is not a good standalone hash function, but it does some mixing with just one multiplication operation. I like the solution you provided. I will try that out.

@ben-manes
Copy link

ben-manes commented Dec 19, 2019

Using @thomasmueller's idea of adding the seed, I adapted my sketch indexing function slightly. This variation ensures that a zero, a multitive constant, doesn't result in the same array indexes across all rows. The balance of cost versus quality of the mixing function followed by per-row seeded indexing has been pretty great in my experience.

int indexOf(int item, int i) {
  long hash = (item + SEED[i]) * SEED[i];
  hash += (hash >>> 32);
  return ((int) hash) & tableMask;
}

@karlmcguire
Copy link
Contributor

This solution proposed by @thomasmueller is about 30% faster without using Fibonacci Hashing. I think it's probably worth it just to make it known that KeyToHash needs to be a strong hashing function as we're not going to mix it.

func (s *Sketch) Increment(hash uint64) {
        // left shifting a because we'll later right shift it with s.offset
	a, b := hash<<32, hash
	for i := range s.blocks {
		// right shift based on the size of the row
		block := &s.blocks[i][a>>s.offset]
		// shift determines whether we use the left or right half of the block
		shift := (a & 1) * 4
		if (*block>>shift)&0x0f < 15 {
			*block += 1 << shift
		}
		a += b
	}
}

@karlmcguire
Copy link
Contributor

Hit ratio testing results:

  • 4 rows: 41% using S3 trace with 400,000 capacity
  • 1 row: 38% using S3 trace with 400,000 capacity
  • 4 rows: 43% using DS1 trace with 4,000,000 capacity
  • 1 row: 41% using DS1 trace with 4,000,000 capacity
  • 4 rows: 59% using OLTP trace with 10,000 capacity
  • 1 row: 59% using OLTP trace with 10,000 capacity

It appears multiple rows are benefitting the hit ratios now, will make a PR.

@FrankReh
Copy link
Author

Okay. Thanks for the clarification. I wasn't sure this level of code needed protecting from a user but makes sense if you are opening up the API. The addition seemed to loose a carry bit once in a while but I realize that is noise compared to what's going on with the TinyLFU anyway.

I don't have the tests in front of me right now but I thought I saw an order of magnitude improvement when going from one row to four. Could something be wrong with that hit ratio testing or maybe its not testing what I thought.

I had tried to create two levels of test criteria. Using a prng, how many random numbers could be inserted before the next random number falsely was reported as being in the cm-sketch already, and beyond that, how many could be inserted before .1% of those already added were being reported as already in before they were actually inserted. Not sure those are useful measurements but it showed the true value of the cm-sketch I thought and the numbers shot up as one would expect when going from one row to two, and then from two to four.

Anyway, I'm happy to wait to see what gets pushed.

@thomasmueller
Copy link

If we're allowing people to provide their own KeyToHash I'd be more comfortable if there was an extra mixing function with low overhead (doesn't necessarily need to be a good hash function, just good enough at mixing a hash somewhat randomly).

My main concern with Fibonacci hashing was that you did it inside the loop. I share the concern that if you allow people to provide their own KeyToHash (so their own hash function), it is problematic. Maybe you want to do at least some amount of mixing within the Increment method... For example, Java HashMap uses this: https://www.connect2java.com/tutorials/collections/how-hashmap-works-internally-in-java/

By the way if you want to speed this up some more, maybe this part could be made faster by using branchless code:

if (*block>>shift)&0x0f < 15 {
    *block += 1 << shift
}

Reason: wherever you have a branch, the CPU will try to predict the outcome. Here, it's probably quite hard, so you will have many branch misses. There are ways around that, by using branchless operations: https://stackoverflow.com/questions/25615440/clamped-increment-integer

For this case, I assume it would be something like this:

// get the old value
x = (*block >> shift) & 0xf
// convert 0xf -> 0, and everything else to a value smaller than 0
x = x - 0xf
// convert smaller than 0 to 1, and 0 to 0
x = x >> 31
// now x is only 1 if we need to add 1, and 0 otherwise
*block += x << shift

@karlmcguire
Copy link
Contributor

@FrankReh I will try to recreate the testing methods you used, although I'm pretty confident because I would never expect the (overall cache) hit ratio to change by orders of magnitude.

@thomasmueller I still need to think about mixing, if anything we'll probably go with the per-row seeds that @ben-manes is using.

As for the branchless operations, interestingly enough the code proposed by you/StackOverflow is about 60% slower than the "branching" method. Here's the two functions I benchmarked:

Branching: 94.6M ops/sec

func (s *Sketch) Increment(hash uint64) {
    a, b := hash<<32, hash
    for i := range s.blocks {
        block := &s.blocks[i][a>>s.offset]
        if shift := (a & 1) * 4; (*block>>shift)&0x0f < 15 {
            *block += 1 << shift
        }
        a += b
    }
}

Branchless: 52.3M ops/sec

func (s *Sketch) Increment(hash uint64) {
    a, b := hash<<32, hash
    for i := range s.blocks {
        block := &s.blocks[i][a>>s.offset]
        shift := (a & 1) * 4
        n := (*block >> shift) & 0x0f
        n = n - 0xf
        n = n >> 7
        *block += n << shift
        a += b
    }
}

Benchmark

func BenchmarkIncrement(b *testing.B) {
    s := NewSketch(1e5)
    h := fnv.New64a()
    hashes := make([]uint64, 1e6)
    for i := range hashes {
        h.Write([]byte(fmt.Sprintf("%d", i)))
        hashes[i] = h.Sum64()
    }
    b.SetBytes(1)
    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        s.Increment(hashes[n&(1e6-1)])
    }
}

I'm not super knowledgable about Go's compiler but I assume that it's doing something weird and that's what's causing this. But I do find it odd...

@ben-manes
Copy link

It was also slower in Java, but the branchless had a large error bound that I think it is negligible. The JIT is pretty good and I didn't look at the assembly to determine how it might have optimized it.

branch: 160,374,302.050 ± 20,625,286.268  ops/s
branch-free: 102,463,137.106 ± 67,214,204.145  ops/s

As much as I like the idea of this optimization, the sketch hasn't been a bottleneck in my experience since its operated on async to the user-facing calls (during cache maintenance rather than on the read/write paths).

@andersfylling
Copy link
Contributor

What version of Go are you guys using in these tests, 1.12.0?

@martinmr martinmr added optimization status/accepted We accept to work on it. priority/P1 Serious issue that requires eventual attention (can wait a bit) labels Feb 21, 2020
@minhaj-shakeel
Copy link
Contributor

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

FrankReh added a commit to FrankReh/ristretto that referenced this issue Dec 23, 2022
One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the
counters of each row to ensure that each row isn't just essentially a
copy of another row, where the only difference is which position the
counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
to a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
    other ideas for incrementing the counters were considered
    by previous authors as well.
FrankReh added a commit to FrankReh/ristretto that referenced this issue Dec 23, 2022
One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the
counters of each row to ensure that each row isn't just essentially a
copy of another row, where the only difference is which position the
counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
to a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
    other ideas for incrementing the counters were considered
    by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can
commit a fix to the cmSketch increment method to get the new test to
pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108.
I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem with
    the actual increment method. A new Sketch test is added that goes
    further and sorts the counters of each row to ensure that each row
    isn't just essentially a copy of another row, where the only
    difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
with a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
    other ideas for incrementing the counters were considered by
    previous authors of the original Java code as well.
FrankReh added a commit to FrankReh/ristretto that referenced this issue Dec 23, 2022
Fixes dgraph-io#108.

Actually there is one fix and one improvement.

Fix:

The fix addresses the initial concern raised in dgraph-io#108, namely that there
had been a prior commit that removed the influence of the row index `i`
on the index used to access the counter field. So while there were four
rows, the results were like there was just one row. The benefit of three
extra rows in the sketch was lost. This commit fixes that by using part
of the spread key differently for each row again. Now the minimum
that is computed by the `Estimate` function really has four distinct
values to work with, rather than four copies of the same counter value.

Improvement:

A few years ago I had the discussion in dgraph-io#108 and ended up getting xor
and shift operations from the Caffeine author and the
[hash-prospector](https://github.com/skeeto/hash-prospector) tool to be
able to index into the cmSketch rows much more uniformly, even when the
keys being given had low entropy.

In further testing, this spread the indexes being used much more
evenly, meaning there would be fewer times the counter value had to
be clipped at the max value, and this should logically make the value
returned by `Estimate` more accurate.

I used the hash-propector tool for 64 bit hashes to come up with the
order of xor and shift operations and the constants to use but given the
size of the search space, of course could not get an exhaustive answer.
And in the years since, the tool has improved. I see a lot of progress
was made for 32 bit hashes. Someone is welcome to look for new constants
for an even better result, or even if they just wanted to document where
the constants come from better than I can now.
@FrankReh FrankReh mentioned this issue Dec 23, 2022
mangalaman93 pushed a commit to FrankReh/ristretto that referenced this issue Oct 14, 2024
One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the
counters of each row to ensure that each row isn't just essentially a
copy of another row, where the only difference is which position the
counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
to a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
    other ideas for incrementing the counters were considered
    by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can
commit a fix to the cmSketch increment method to get the new test to
pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108.
I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem with
    the actual increment method. A new Sketch test is added that goes
    further and sorts the counters of each row to ensure that each row
    isn't just essentially a copy of another row, where the only
    difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
with a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
    other ideas for incrementing the counters were considered by
    previous authors of the original Java code as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to work on it.
Development

Successfully merging a pull request may close this issue.

7 participants