Implement sharding based on Rendezvous Hashing #238

meroton-benjamin · 2025-02-03T22:22:48Z

As part of buildbarn/bb-adrs#5 we are attempting to improve the resharding experience. In this pull request we have an implementation of sharding based on Rendezvous Hashing where shards are described by a map instead of an array.

pkg/blobstore/sharding/rendezvous_shard_selector.go

meroton-benjamin · 2025-02-15T20:55:51Z

The newest implementation integer math with fixed point calculations. Log2 is calculated with the integer part from the number of bits + a combination of a lookup table with linear interpolation for values.

I have evaluated the error by running a billion entries over 5 shards of weight 1, 2, 4, 7, 1 respectively. I have also experimented with skipping the linear interpolation and/or the lookup table.

The current solution takes about 45 seconds on my laptop to calculate a billion entries. It has a distribution error of <5e-3 using a lookup table of 130 bytes.

It is possible to get higher precision, using a lookup table of 1 kilobytes the error can go down to 5e-4 which is similar to what I got with the math.Log2 solution

It is also possible to speed up the implementation by using a larger lookup table and foregoing the interpolation, at a lookup table of 260 kilobytes it also gets a relative error of 5e-4 but can instead do the billion entries in about 30 seconds.

Skipping the lookup table and doing straight interpolation is the fastest since but gives a significant errors (~15%).

There are also some tests added to verify that the distribution is unchanged to prevent accidental changes. It verifies that the exact distribution of the first 10 million entries is unchanged that the exact first 20 values are unchanged. It takes about half a second to run on my machine.

The lookup table is calculated at package initialization, this could potentially give different answers in case the lookup table is calculated on a platform which produces a different value when rounding log2 to 20 bits of precision. To guard against this I have added a checksum for verifying the lookup table at creation of a ShardedBlobAccess Another more stable solution would be to calculate it outside of the process and have Bazel inject the lookup table.

Since changing the constants gives small perturbations to the distribution we should probably be pretty set on the parameters we want to use for interpolation.

EdSchouten · 2025-02-16T20:06:36Z

I think the math we do can be simplified a bit further. What if we created a LUT of $-1/\log(x)$, using an algorithm like the one below?

// Create a lookup table for f(x) = -1/log(x) between 0<=x<=1,
// normalized to the range of uint16, which can be used to
// perform linear interpolation.
const inBits = 6
const outBits = 16 + 1
const maxOut = math.MaxUint16
var lut [1<<inBits + 1]uint16

// Scale the lookup table so that the final entry to be used as
// a base (i.e. the penultimate entry of the lookup table) is
// set to 2^16-1.
scale := maxOut * math.Log(float64(1<<inBits-1)/float64(1<<inBits))
for i := 0; i < 1<<inBits; i++ {
        lut[i] = uint16(scale / math.Log(float64(i)/float64(1<<inBits)))
}

// -1/log(x) converges to infinity at x=1. Work around this by
// ensuring the delta between the two final entries is 2^16-1.
lut[1<<inBits] = lut[1<<inBits-1] + maxOut

That will yield the following LUT:

var lut = [...]uint16{
        0x0000, 0x00f8, 0x0129, 0x0151, 0x0174, 0x0194, 0x01b4, 0x01d2,
        0x01f0, 0x020e, 0x022b, 0x024a, 0x0268, 0x0287, 0x02a7, 0x02c7,
        0x02e8, 0x030a, 0x032d, 0x0351, 0x0377, 0x039e, 0x03c6, 0x03f0,
        0x041c, 0x0449, 0x0479, 0x04ab, 0x04e0, 0x0517, 0x0552, 0x058f,
        0x05d0, 0x0616, 0x065f, 0x06ae, 0x0701, 0x075b, 0x07bb, 0x0823,
        0x0893, 0x090d, 0x0992, 0x0a23, 0x0ac2, 0x0b72, 0x0c35, 0x0d0e,
        0x0e03, 0x0f18, 0x1054, 0x11c1, 0x136a, 0x1560, 0x17ba, 0x1a9a,
        0x1e31, 0x22ce, 0x28f4, 0x318f, 0x3e77, 0x53f9, 0x7efb, 0xffff, // <-- Note how this is 2^16-1.
        0xfffe, // <-- And this is 2^16-1 more than the entry before.
}

We can then compute the following:

hash := uint32(....)

// Destructure the hash into top/bottom bits.
const trailingBits = 32 - inBits
top := hash >> trailingBits
bottom := hash & (1<<trailingBits - 1)

// Obtain the base and the slope from the lookup table.
base := lut[top]
slope := lut[top+1] - base

// Compute 1/log(x) normalized to the range of uint32.
oneDividedByLog := uint64(base)<<trailingBits + uint64(bottom)*uint64(slope)
oneDividedByLogTruncated := oneDividedByLog >> (outBits-inBits)

Because weights are also 32-bit integers, we can then compute the score (without any risk of overflows) as follows:

score := uint64(weight) * oneDividedByLogTruncated

Edit: That said, instead of shifting oneDividedByLog by outBits-inBits, you could first call bits.Len() on all weights to figure out their magnitude. Most of the times people set weights to low values (e.g., just 1). In those cases fewer bits need to be discarded, or none at all. oneDividedByLog only sets the bottom 32-6+17=43 bits, meaning that if the weights are all all smaller than $2^{64-43}=2^{21}$, no precision is lost.

Would it make sense to use an algorithm like this instead?

pkg/blobstore/sharding/rendezvous_shard_selector.go

EdSchouten · 2025-02-16T18:07:08Z

pkg/blobstore/sharding/rendezvous_shard_selector.go

+		if collision, exists := keyMap[hash]; exists {
+			return nil, fmt.Errorf("Hash collision between shards: %s and %s", shard.Key, collision)
+		}


Considering that we're already using a map at the Proto level, should we even care about this? I think we should think of this being a precondition to calling this function.

In case of a hash collision between two different keys the latter of the backends would become unused, and since the iteration order of the map is not well defined which backend is the latter would vary from instance to instance.

Since we're using 64-bits from sha256 hash collisions are unlikely, we could remove the check if you prefer but it's only checked at creation of the shard selector anyway.

EdSchouten · 2025-02-16T20:18:00Z

My concern is that this new algorithm may be (significantly) slower if the number of shards is high. Could we please offer the existing algorithm as a secondary implementation?

meroton-benjamin · 2025-02-17T16:45:56Z

I have an implementation of your suggested modification. It is effectively almost the same implementation, see below for difference

F suffix = estimate the fraction -1/log(x)
L suffix = estimate Log2(x)
FP{F,L} = number of bits used for fixed point representation of fractional values
LS{F,L} = number of bits used for lookup table
IP{F,L} = number of bits used for interpolation

Score function:

func scoreF(x uint64, weight uint32) uint64 {
	return uint64(weight)*LookupF(x)
}
func scoreL(x uint64, weight uint32) uint64 {
	neglogx := (uint64(64)<<FPL) - LookupL(x)
	return (uint64(weight) << 32) / neglogx
}

lut init:

func init() {
	// lookup table for fraction -1/log(x) where x in [0,1]
	lutF[0] = 0
	for i := 1; i < (1<<LSF); i++ {
		target := float64(i) / (1<<LSF)
		val := uint32(math.Round(-1.0/math.Log(target)*math.Pow(2.0, float64(FPF))))
		lutF[i] = val
	}
	lutF[1<<LSF] = lutF[1<<LSF-1]+^uint32(0)
	// lookup table for fraction log2(x) where x in [1,2]
	for i := 0; i < (1<<LSL); i++ {
		target := 1.0+float64(i)/(1<<LSL)
		val := uint32(math.Round(math.Log2(target)*math.Pow(2.0, float64(FPL))))
		lutL[i] = val
	}
	lutL[1<<LSL] = 1<<FPL
}

Lookup implementation:

func LookupF(x uint64) uint64 {
	index := x>>(64-LSF)
	interp := (x >> (64 - IPF - LSF)) & ((1<<IPF)-1)
	current := uint64(lutF[index])
	next := uint64(lutF[index+1])
	delta := next-current
	return (current+(delta*interp)>>IPF)
}

func LookupL(x uint64) uint64 {
	msb := bits.Len64(x)-1
	var bitfield uint64
	if msb >= FPL {
		bitfield = (x >> (msb-FPL)) & ((1<<(FPL))-1)
	} else {
		bitfield = (x << (FPL-msb)) & ((1<<(FPL))-1)
	}
	index := bitfield>>(FPL-LSL)
	interp := bitfield&((1<<(FPL-LSL)-1))
	current := uint64(lutL[index])
	next := uint64(lutL[index+1])
	delta := next-current
	return (uint64(msb)<<FPL) | (current+(delta*interp)>>(FPL-LSL))
}

There are some quantitative differences when it comes down to the size of the lookup table required for sufficient precision and speed of implementations. The F implementation is slightly faster requiring ~5ns per entry per server, the L implementation requires ~9ns per entry per server. This is coincidentally about the same difference as when the L implementation with only a lookup table (no of values in the table)

The L implementation requires a smaller lut for sufficient precision, se comparision here using a lookup table of fixed point uint32 values for both with linear interpolation between both values.

  Size of lookup table (in bits)
 ---------- ------------------ ------------------ 
  Existing   F-implementation   L-implementation  
  1e-2       7                  3                 
  1e-3       9                  5                 
  1e-4       12                 8

Regarding performance

The implementation does indeed get slower as the number of shards grow large (results from running GetShard in a loop)

 Cost of a GetShard call
  Shard Count   Shard Permuter   F-implementation   L-implementation  
 ------------- ---------------- ------------------ ------------------ 
            1   20ns             5ns                9ns               
            2   20ns             9ns                18ns              
            3   20ns             14ns               26ns              
            4   20ns             18ns               35ns              
            5   20ns             23ns               44ns

While I don't have the data to support this, my gut feeling is that the amount of compute spent figuring out which blob belongs to which shard is unimportant, if you have an FMB for 10k blobs destined for n shards you will have to send n FMB requests and perform N*10k GetShard calls. The FMB requests themselves will likely be far more costly than the extra cost of 50-90 microseconds per shard. If performance is the only reason to support both sharding methods I don't think it's worth the effort. We could however want to support both sharding methods for other reasons.

Regarding reclaiming bits of precision back from the weight

We could reduce the maximum number of bits we allow the weight to be, however allowing all 32 bits to still allows us ot have very high precision, even if we improve the balance of addresses between the shards, each individual blob will also have a µ and a σ which will dominate the equation for some value. Looking at my clusters, todays algorithm which gives a perfect distribution of address space still has about a 10% variance between shards of the same size due to individual blob size variance. Adding an addressing error of 1e-5 or 1e-6 would simply be lost in the noise.

EdSchouten · 2025-02-17T18:32:10Z

The L implementation requires a smaller lut for sufficient precision

Ah, yeah. That's what the core difference is between the two.

The nice thing about log2(x) where 1<=x<=2 is that it's very smooth, meaning the LUT of LookupL will be fairly accurate. This differs from -1/log2(x) where 0<=x<=1, which is very steep close to both 0 and 1.
LookupL can also reuse the same LUT for each powers of two, whereas for the F implementation we only have a single LUT for the entire spectrum, meaning that we need log2(64) = 6 more bits to get the same granularity.

So let's stick with algorithm L, but let's see if we can make some minor cleanups to it. I'll post some more comments in bit. Could you also please move it out of draft when you think it's appropriate?

pkg/blobstore/sharding/rendezvous_shard_selector.go

EdSchouten · 2025-02-17T19:06:49Z

pkg/blobstore/sharding/rendezvous_shard_selector.go

+	var best uint64
+	var bestIndex int
+	for _, shard := range s.shards {
+		mixed := splitmix64(shard.hash^hash)


I wonder whether simply XORing these two values together is good enough at mixing them. This means that every bit of the digest's hash is only affected by exactly one bit of the server hash.

Is this a common way of combining hash values?

xoring is a common way of preserving entropy, as long as the sources of randomness is uncorrelated (which the hash of the digest and the hash of the server key should be) the operation is guaranteed to only increase entropy. The mixing to the keyspace comes from splitmix64 which has very good randomness properties at a super slim runtime cost.

EdSchouten · 2025-02-17T19:08:17Z

pkg/proto/configuration/blobstore/blobstore.proto

@@ -285,13 +285,21 @@ message ShardingBlobAccessConfiguration {
  // the data.
  uint64 hash_initialization = 1;


Do we think this field is still needed? Now that shards are identified with a string, the potential concerns arising from layering multiple copies of ShardingBlobAccess don't really apply anymore, assuming each level uses a distinct set of keys (which it likely will).

I have removed hash_initialization and the fnv algorithm in the latest version in favor of using the first 64 bits of the digest hash.

EdSchouten · 2025-02-17T20:12:55Z

pkg/blobstore/sharding/shard_selector.go

+// A description of shard. The shard selector will resolve to the same shard
+// independent of the order of shards, but the returned index will correspond
+// to the index sent to the ShardSelectors constructor.
+type Shard struct {
+	Key string
+	Weight uint32
+}


This type is only used by ShardingBlobAccess, right? ShardingBlobAccess does not need to be aware of the weights of individual shards.

Maybe it's better to add the following struct to sharding_blob_access.go:

type ShardBackend struct { Backend blobstore.BlobAccess Description string }

?

The Shard type is used by the ShardSelector as well as the ShardingBlobAccess but I modified the ShardingBlobAccess to use a ShardBackend type instead.

pkg/blobstore/sharding/sharding_blob_access.go

Improves the ability to perform resharding by switching the sharding struct from array to map. Each map entry has a key which is used in rendezvous hashing to deterministically select which shard to use from the collection of keys. When a shard is removed it is guaranteed that only blobs which belonged to the removed shard will resolve to a new shard. In combination with ReadFallbackConfigurations this allows adding and removing shards with minimal need to rebalance the blobs between the shards. See https://github.com/buildbarn/bb-adrs buildbarn#11 for more details.

EdSchouten reviewed Feb 4, 2025

View reviewed changes

pkg/blobstore/sharding/rendezvous_shard_selector.go Outdated Show resolved Hide resolved

pkg/blobstore/sharding/rendezvous_shard_selector.go Outdated Show resolved Hide resolved

meroton-benjamin force-pushed the master branch from f0b6827 to 3a38f82 Compare February 15, 2025 19:56

EdSchouten reviewed Feb 16, 2025

View reviewed changes

EdSchouten reviewed Feb 17, 2025

View reviewed changes

pkg/blobstore/sharding/sharding_blob_access.go Outdated Show resolved Hide resolved

meroton-benjamin force-pushed the master branch from 3a38f82 to 4011c29 Compare February 20, 2025 19:47

meroton-benjamin force-pushed the master branch from 4011c29 to 86f3bfb Compare February 20, 2025 21:17

meroton-benjamin marked this pull request as ready for review February 20, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sharding based on Rendezvous Hashing #238

Implement sharding based on Rendezvous Hashing #238

meroton-benjamin commented Feb 3, 2025

meroton-benjamin commented Feb 15, 2025

EdSchouten commented Feb 16, 2025 •

edited

Loading

EdSchouten Feb 16, 2025

meroton-benjamin Feb 20, 2025

EdSchouten commented Feb 16, 2025

meroton-benjamin commented Feb 17, 2025

EdSchouten commented Feb 17, 2025

EdSchouten Feb 17, 2025

meroton-benjamin Feb 20, 2025

EdSchouten Feb 17, 2025

meroton-benjamin Feb 20, 2025

EdSchouten Feb 17, 2025

meroton-benjamin Feb 20, 2025

		@@ -285,13 +285,21 @@ message ShardingBlobAccessConfiguration {
		// the data.
		uint64 hash_initialization = 1;

Implement sharding based on Rendezvous Hashing #238

Are you sure you want to change the base?

Implement sharding based on Rendezvous Hashing #238

Conversation

meroton-benjamin commented Feb 3, 2025

meroton-benjamin commented Feb 15, 2025

EdSchouten commented Feb 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten commented Feb 16, 2025

meroton-benjamin commented Feb 17, 2025

EdSchouten commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten commented Feb 16, 2025 •

edited

Loading