Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sharding based on Rendezvous Hashing #238

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

meroton-benjamin
Copy link

As part of buildbarn/bb-adrs#5 we are attempting to improve the resharding experience. In this pull request we have an implementation of sharding based on Rendezvous Hashing where shards are described by a map instead of an array.

@meroton-benjamin
Copy link
Author

The newest implementation integer math with fixed point calculations. Log2 is calculated with the integer part from the number of bits + a combination of a lookup table with linear interpolation for values.

I have evaluated the error by running a billion entries over 5 shards of weight 1, 2, 4, 7, 1 respectively. I have also experimented with skipping the linear interpolation and/or the lookup table.

The current solution takes about 45 seconds on my laptop to calculate a billion entries. It has a distribution error of <5e-3 using a lookup table of 130 bytes.

It is possible to get higher precision, using a lookup table of 1 kilobytes the error can go down to 5e-4 which is similar to what I got with the math.Log2 solution

It is also possible to speed up the implementation by using a larger lookup table and foregoing the interpolation, at a lookup table of 260 kilobytes it also gets a relative error of 5e-4 but can instead do the billion entries in about 30 seconds.

Skipping the lookup table and doing straight interpolation is the fastest since but gives a significant errors (~15%).

There are also some tests added to verify that the distribution is unchanged to prevent accidental changes. It verifies that the exact distribution of the first 10 million entries is unchanged that the exact first 20 values are unchanged. It takes about half a second to run on my machine.

The lookup table is calculated at package initialization, this could potentially give different answers in case the lookup table is calculated on a platform which produces a different value when rounding log2 to 20 bits of precision. To guard against this I have added a checksum for verifying the lookup table at creation of a ShardedBlobAccess Another more stable solution would be to calculate it outside of the process and have Bazel inject the lookup table.

Since changing the constants gives small perturbations to the distribution we should probably be pretty set on the parameters we want to use for interpolation.

@EdSchouten
Copy link
Member

EdSchouten commented Feb 16, 2025

I think the math we do can be simplified a bit further. What if we created a LUT of $-1/\log(x)$, using an algorithm like the one below?

// Create a lookup table for f(x) = -1/log(x) between 0<=x<=1,
// normalized to the range of uint16, which can be used to
// perform linear interpolation.
const inBits = 6
const outBits = 16 + 1
const maxOut = math.MaxUint16
var lut [1<<inBits + 1]uint16

// Scale the lookup table so that the final entry to be used as
// a base (i.e. the penultimate entry of the lookup table) is
// set to 2^16-1.
scale := maxOut * math.Log(float64(1<<inBits-1)/float64(1<<inBits))
for i := 0; i < 1<<inBits; i++ {
        lut[i] = uint16(scale / math.Log(float64(i)/float64(1<<inBits)))
}

// -1/log(x) converges to infinity at x=1. Work around this by
// ensuring the delta between the two final entries is 2^16-1.
lut[1<<inBits] = lut[1<<inBits-1] + maxOut

That will yield the following LUT:

var lut = [...]uint16{
        0x0000, 0x00f8, 0x0129, 0x0151, 0x0174, 0x0194, 0x01b4, 0x01d2,
        0x01f0, 0x020e, 0x022b, 0x024a, 0x0268, 0x0287, 0x02a7, 0x02c7,
        0x02e8, 0x030a, 0x032d, 0x0351, 0x0377, 0x039e, 0x03c6, 0x03f0,
        0x041c, 0x0449, 0x0479, 0x04ab, 0x04e0, 0x0517, 0x0552, 0x058f,
        0x05d0, 0x0616, 0x065f, 0x06ae, 0x0701, 0x075b, 0x07bb, 0x0823,
        0x0893, 0x090d, 0x0992, 0x0a23, 0x0ac2, 0x0b72, 0x0c35, 0x0d0e,
        0x0e03, 0x0f18, 0x1054, 0x11c1, 0x136a, 0x1560, 0x17ba, 0x1a9a,
        0x1e31, 0x22ce, 0x28f4, 0x318f, 0x3e77, 0x53f9, 0x7efb, 0xffff, // <-- Note how this is 2^16-1.
        0xfffe, // <-- And this is 2^16-1 more than the entry before.
}

We can then compute the following:

hash := uint32(....)

// Destructure the hash into top/bottom bits.
const trailingBits = 32 - inBits
top := hash >> trailingBits
bottom := hash & (1<<trailingBits - 1)

// Obtain the base and the slope from the lookup table.
base := lut[top]
slope := lut[top+1] - base

// Compute 1/log(x) normalized to the range of uint32.
oneDividedByLog := uint64(base)<<trailingBits + uint64(bottom)*uint64(slope)
oneDividedByLogTruncated := oneDividedByLog >> (outBits-inBits)

Because weights are also 32-bit integers, we can then compute the score (without any risk of overflows) as follows:

score := uint64(weight) * oneDividedByLogTruncated

Edit: That said, instead of shifting oneDividedByLog by outBits-inBits, you could first call bits.Len() on all weights to figure out their magnitude. Most of the times people set weights to low values (e.g., just 1). In those cases fewer bits need to be discarded, or none at all. oneDividedByLog only sets the bottom 32-6+17=43 bits, meaning that if the weights are all all smaller than $2^{64-43}=2^{21}$, no precision is lost.

Would it make sense to use an algorithm like this instead?

Comment on lines +51 to +36
if collision, exists := keyMap[hash]; exists {
return nil, fmt.Errorf("Hash collision between shards: %s and %s", shard.Key, collision)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that we're already using a map at the Proto level, should we even care about this? I think we should think of this being a precondition to calling this function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a hash collision between two different keys the latter of the backends would become unused, and since the iteration order of the map is not well defined which backend is the latter would vary from instance to instance.

Since we're using 64-bits from sha256 hash collisions are unlikely, we could remove the check if you prefer but it's only checked at creation of the shard selector anyway.

@EdSchouten
Copy link
Member

My concern is that this new algorithm may be (significantly) slower if the number of shards is high. Could we please offer the existing algorithm as a secondary implementation?

@meroton-benjamin
Copy link
Author

I have an implementation of your suggested modification. It is effectively almost the same implementation, see below for difference

  • F suffix = estimate the fraction -1/log(x)
  • L suffix = estimate Log2(x)
  • FP{F,L} = number of bits used for fixed point representation of fractional values
  • LS{F,L} = number of bits used for lookup table
  • IP{F,L} = number of bits used for interpolation

Score function:

func scoreF(x uint64, weight uint32) uint64 {
	return uint64(weight)*LookupF(x)
}
func scoreL(x uint64, weight uint32) uint64 {
	neglogx := (uint64(64)<<FPL) - LookupL(x)
	return (uint64(weight) << 32) / neglogx
}

lut init:

func init() {
	// lookup table for fraction -1/log(x) where x in [0,1]
	lutF[0] = 0
	for i := 1; i < (1<<LSF); i++ {
		target := float64(i) / (1<<LSF)
		val := uint32(math.Round(-1.0/math.Log(target)*math.Pow(2.0, float64(FPF))))
		lutF[i] = val
	}
	lutF[1<<LSF] = lutF[1<<LSF-1]+^uint32(0)
	// lookup table for fraction log2(x) where x in [1,2]
	for i := 0; i < (1<<LSL); i++ {
		target := 1.0+float64(i)/(1<<LSL)
		val := uint32(math.Round(math.Log2(target)*math.Pow(2.0, float64(FPL))))
		lutL[i] = val
	}
	lutL[1<<LSL] = 1<<FPL
}

Lookup implementation:

func LookupF(x uint64) uint64 {
	index := x>>(64-LSF)
	interp := (x >> (64 - IPF - LSF)) & ((1<<IPF)-1)
	current := uint64(lutF[index])
	next := uint64(lutF[index+1])
	delta := next-current
	return (current+(delta*interp)>>IPF)
}

func LookupL(x uint64) uint64 {
	msb := bits.Len64(x)-1
	var bitfield uint64
	if msb >= FPL {
		bitfield = (x >> (msb-FPL)) & ((1<<(FPL))-1)
	} else {
		bitfield = (x << (FPL-msb)) & ((1<<(FPL))-1)
	}
	index := bitfield>>(FPL-LSL)
	interp := bitfield&((1<<(FPL-LSL)-1))
	current := uint64(lutL[index])
	next := uint64(lutL[index+1])
	delta := next-current
	return (uint64(msb)<<FPL) | (current+(delta*interp)>>(FPL-LSL))
}

There are some quantitative differences when it comes down to the size of the lookup table required for sufficient precision and speed of implementations. The F implementation is slightly faster requiring ~5ns per entry per server, the L implementation requires ~9ns per entry per server. This is coincidentally about the same difference as when the L implementation with only a lookup table (no of values in the table)

The L implementation requires a smaller lut for sufficient precision, se comparision here using a lookup table of fixed point uint32 values for both with linear interpolation between both values.

  Size of lookup table (in bits)
 ---------- ------------------ ------------------ 
  Existing   F-implementation   L-implementation  
  1e-2       7                  3                 
  1e-3       9                  5                 
  1e-4       12                 8                 

Regarding performance

The implementation does indeed get slower as the number of shards grow large (results from running GetShard in a loop)

 Cost of a GetShard call
  Shard Count   Shard Permuter   F-implementation   L-implementation  
 ------------- ---------------- ------------------ ------------------ 
            1   20ns             5ns                9ns               
            2   20ns             9ns                18ns              
            3   20ns             14ns               26ns              
            4   20ns             18ns               35ns              
            5   20ns             23ns               44ns              

While I don't have the data to support this, my gut feeling is that the amount of compute spent figuring out which blob belongs to which shard is unimportant, if you have an FMB for 10k blobs destined for n shards you will have to send n FMB requests and perform N*10k GetShard calls. The FMB requests themselves will likely be far more costly than the extra cost of 50-90 microseconds per shard. If performance is the only reason to support both sharding methods I don't think it's worth the effort. We could however want to support both sharding methods for other reasons.

Regarding reclaiming bits of precision back from the weight

We could reduce the maximum number of bits we allow the weight to be, however allowing all 32 bits to still allows us ot have very high precision, even if we improve the balance of addresses between the shards, each individual blob will also have a µ and a σ which will dominate the equation for some value. Looking at my clusters, todays algorithm which gives a perfect distribution of address space still has about a 10% variance between shards of the same size due to individual blob size variance. Adding an addressing error of 1e-5 or 1e-6 would simply be lost in the noise.

@EdSchouten
Copy link
Member

The L implementation requires a smaller lut for sufficient precision

Ah, yeah. That's what the core difference is between the two.

  • The nice thing about log2(x) where 1<=x<=2 is that it's very smooth, meaning the LUT of LookupL will be fairly accurate. This differs from -1/log2(x) where 0<=x<=1, which is very steep close to both 0 and 1.
  • LookupL can also reuse the same LUT for each powers of two, whereas for the F implementation we only have a single LUT for the entire spectrum, meaning that we need log2(64) = 6 more bits to get the same granularity.

So let's stick with algorithm L, but let's see if we can make some minor cleanups to it. I'll post some more comments in bit. Could you also please move it out of draft when you think it's appropriate?

var best uint64
var bestIndex int
for _, shard := range s.shards {
mixed := splitmix64(shard.hash^hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether simply XORing these two values together is good enough at mixing them. This means that every bit of the digest's hash is only affected by exactly one bit of the server hash.

Is this a common way of combining hash values?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xoring is a common way of preserving entropy, as long as the sources of randomness is uncorrelated (which the hash of the digest and the hash of the server key should be) the operation is guaranteed to only increase entropy. The mixing to the keyspace comes from splitmix64 which has very good randomness properties at a super slim runtime cost.

@@ -285,13 +285,21 @@ message ShardingBlobAccessConfiguration {
// the data.
uint64 hash_initialization = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this field is still needed? Now that shards are identified with a string, the potential concerns arising from layering multiple copies of ShardingBlobAccess don't really apply anymore, assuming each level uses a distinct set of keys (which it likely will).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed hash_initialization and the fnv algorithm in the latest version in favor of using the first 64 bits of the digest hash.

Comment on lines +14 to +20
// A description of shard. The shard selector will resolve to the same shard
// independent of the order of shards, but the returned index will correspond
// to the index sent to the ShardSelectors constructor.
type Shard struct {
Key string
Weight uint32
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is only used by ShardingBlobAccess, right? ShardingBlobAccess does not need to be aware of the weights of individual shards.

Maybe it's better to add the following struct to sharding_blob_access.go:

type ShardBackend struct {
    Backend blobstore.BlobAccess
    Description string
}

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Shard type is used by the ShardSelector as well as the ShardingBlobAccess but I modified the ShardingBlobAccess to use a ShardBackend type instead.

Improves the ability to perform resharding by switching the sharding
struct from array to map. Each map entry has a key which is used in
rendezvous hashing to deterministically select which shard to use from
the collection of keys. When a shard is removed it is guaranteed that
only blobs which belonged to the removed shard will resolve to a new
shard.

In combination with ReadFallbackConfigurations this allows adding and
removing shards with minimal need to rebalance the blobs between the
shards.

See https://github.com/buildbarn/bb-adrs buildbarn#11 for more details.
@meroton-benjamin meroton-benjamin marked this pull request as ready for review February 20, 2025 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants