-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sharding based on Rendezvous Hashing #238
base: master
Are you sure you want to change the base?
Conversation
f0b6827
to
3a38f82
Compare
The newest implementation integer math with fixed point calculations. Log2 is calculated with the integer part from the number of bits + a combination of a lookup table with linear interpolation for values. I have evaluated the error by running a billion entries over 5 shards of weight 1, 2, 4, 7, 1 respectively. I have also experimented with skipping the linear interpolation and/or the lookup table. The current solution takes about 45 seconds on my laptop to calculate a billion entries. It has a distribution error of <5e-3 using a lookup table of 130 bytes. It is possible to get higher precision, using a lookup table of 1 kilobytes the error can go down to 5e-4 which is similar to what I got with the math.Log2 solution It is also possible to speed up the implementation by using a larger lookup table and foregoing the interpolation, at a lookup table of 260 kilobytes it also gets a relative error of 5e-4 but can instead do the billion entries in about 30 seconds. Skipping the lookup table and doing straight interpolation is the fastest since but gives a significant errors (~15%). There are also some tests added to verify that the distribution is unchanged to prevent accidental changes. It verifies that the exact distribution of the first 10 million entries is unchanged that the exact first 20 values are unchanged. It takes about half a second to run on my machine. The lookup table is calculated at package initialization, this could potentially give different answers in case the lookup table is calculated on a platform which produces a different value when rounding log2 to 20 bits of precision. To guard against this I have added a checksum for verifying the lookup table at creation of a ShardedBlobAccess Another more stable solution would be to calculate it outside of the process and have Bazel inject the lookup table. Since changing the constants gives small perturbations to the distribution we should probably be pretty set on the parameters we want to use for interpolation. |
I think the math we do can be simplified a bit further. What if we created a LUT of // Create a lookup table for f(x) = -1/log(x) between 0<=x<=1,
// normalized to the range of uint16, which can be used to
// perform linear interpolation.
const inBits = 6
const outBits = 16 + 1
const maxOut = math.MaxUint16
var lut [1<<inBits + 1]uint16
// Scale the lookup table so that the final entry to be used as
// a base (i.e. the penultimate entry of the lookup table) is
// set to 2^16-1.
scale := maxOut * math.Log(float64(1<<inBits-1)/float64(1<<inBits))
for i := 0; i < 1<<inBits; i++ {
lut[i] = uint16(scale / math.Log(float64(i)/float64(1<<inBits)))
}
// -1/log(x) converges to infinity at x=1. Work around this by
// ensuring the delta between the two final entries is 2^16-1.
lut[1<<inBits] = lut[1<<inBits-1] + maxOut That will yield the following LUT: var lut = [...]uint16{
0x0000, 0x00f8, 0x0129, 0x0151, 0x0174, 0x0194, 0x01b4, 0x01d2,
0x01f0, 0x020e, 0x022b, 0x024a, 0x0268, 0x0287, 0x02a7, 0x02c7,
0x02e8, 0x030a, 0x032d, 0x0351, 0x0377, 0x039e, 0x03c6, 0x03f0,
0x041c, 0x0449, 0x0479, 0x04ab, 0x04e0, 0x0517, 0x0552, 0x058f,
0x05d0, 0x0616, 0x065f, 0x06ae, 0x0701, 0x075b, 0x07bb, 0x0823,
0x0893, 0x090d, 0x0992, 0x0a23, 0x0ac2, 0x0b72, 0x0c35, 0x0d0e,
0x0e03, 0x0f18, 0x1054, 0x11c1, 0x136a, 0x1560, 0x17ba, 0x1a9a,
0x1e31, 0x22ce, 0x28f4, 0x318f, 0x3e77, 0x53f9, 0x7efb, 0xffff, // <-- Note how this is 2^16-1.
0xfffe, // <-- And this is 2^16-1 more than the entry before.
} We can then compute the following: hash := uint32(....)
// Destructure the hash into top/bottom bits.
const trailingBits = 32 - inBits
top := hash >> trailingBits
bottom := hash & (1<<trailingBits - 1)
// Obtain the base and the slope from the lookup table.
base := lut[top]
slope := lut[top+1] - base
// Compute 1/log(x) normalized to the range of uint32.
oneDividedByLog := uint64(base)<<trailingBits + uint64(bottom)*uint64(slope)
oneDividedByLogTruncated := oneDividedByLog >> (outBits-inBits) Because weights are also 32-bit integers, we can then compute the score (without any risk of overflows) as follows: score := uint64(weight) * oneDividedByLogTruncated Edit: That said, instead of shifting Would it make sense to use an algorithm like this instead? |
if collision, exists := keyMap[hash]; exists { | ||
return nil, fmt.Errorf("Hash collision between shards: %s and %s", shard.Key, collision) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that we're already using a map
at the Proto level, should we even care about this? I think we should think of this being a precondition to calling this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a hash collision between two different keys the latter of the backends would become unused, and since the iteration order of the map is not well defined which backend is the latter would vary from instance to instance.
Since we're using 64-bits from sha256 hash collisions are unlikely, we could remove the check if you prefer but it's only checked at creation of the shard selector anyway.
My concern is that this new algorithm may be (significantly) slower if the number of shards is high. Could we please offer the existing algorithm as a secondary implementation? |
I have an implementation of your suggested modification. It is effectively almost the same implementation, see below for difference
Score function: func scoreF(x uint64, weight uint32) uint64 {
return uint64(weight)*LookupF(x)
}
func scoreL(x uint64, weight uint32) uint64 {
neglogx := (uint64(64)<<FPL) - LookupL(x)
return (uint64(weight) << 32) / neglogx
} lut init: func init() {
// lookup table for fraction -1/log(x) where x in [0,1]
lutF[0] = 0
for i := 1; i < (1<<LSF); i++ {
target := float64(i) / (1<<LSF)
val := uint32(math.Round(-1.0/math.Log(target)*math.Pow(2.0, float64(FPF))))
lutF[i] = val
}
lutF[1<<LSF] = lutF[1<<LSF-1]+^uint32(0)
// lookup table for fraction log2(x) where x in [1,2]
for i := 0; i < (1<<LSL); i++ {
target := 1.0+float64(i)/(1<<LSL)
val := uint32(math.Round(math.Log2(target)*math.Pow(2.0, float64(FPL))))
lutL[i] = val
}
lutL[1<<LSL] = 1<<FPL
} Lookup implementation: func LookupF(x uint64) uint64 {
index := x>>(64-LSF)
interp := (x >> (64 - IPF - LSF)) & ((1<<IPF)-1)
current := uint64(lutF[index])
next := uint64(lutF[index+1])
delta := next-current
return (current+(delta*interp)>>IPF)
}
func LookupL(x uint64) uint64 {
msb := bits.Len64(x)-1
var bitfield uint64
if msb >= FPL {
bitfield = (x >> (msb-FPL)) & ((1<<(FPL))-1)
} else {
bitfield = (x << (FPL-msb)) & ((1<<(FPL))-1)
}
index := bitfield>>(FPL-LSL)
interp := bitfield&((1<<(FPL-LSL)-1))
current := uint64(lutL[index])
next := uint64(lutL[index+1])
delta := next-current
return (uint64(msb)<<FPL) | (current+(delta*interp)>>(FPL-LSL))
} There are some quantitative differences when it comes down to the size of the lookup table required for sufficient precision and speed of implementations. The F implementation is slightly faster requiring ~5ns per entry per server, the L implementation requires ~9ns per entry per server. This is coincidentally about the same difference as when the L implementation with only a lookup table (no of values in the table) The L implementation requires a smaller lut for sufficient precision, se comparision here using a lookup table of fixed point uint32 values for both with linear interpolation between both values.
Regarding performance The implementation does indeed get slower as the number of shards grow large (results from running GetShard in a loop)
While I don't have the data to support this, my gut feeling is that the amount of compute spent figuring out which blob belongs to which shard is unimportant, if you have an FMB for 10k blobs destined for n shards you will have to send n FMB requests and perform N*10k GetShard calls. The FMB requests themselves will likely be far more costly than the extra cost of 50-90 microseconds per shard. If performance is the only reason to support both sharding methods I don't think it's worth the effort. We could however want to support both sharding methods for other reasons. Regarding reclaiming bits of precision back from the weight We could reduce the maximum number of bits we allow the weight to be, however allowing all 32 bits to still allows us ot have very high precision, even if we improve the balance of addresses between the shards, each individual blob will also have a µ and a σ which will dominate the equation for some value. Looking at my clusters, todays algorithm which gives a perfect distribution of address space still has about a 10% variance between shards of the same size due to individual blob size variance. Adding an addressing error of 1e-5 or 1e-6 would simply be lost in the noise. |
Ah, yeah. That's what the core difference is between the two.
So let's stick with algorithm L, but let's see if we can make some minor cleanups to it. I'll post some more comments in bit. Could you also please move it out of draft when you think it's appropriate? |
var best uint64 | ||
var bestIndex int | ||
for _, shard := range s.shards { | ||
mixed := splitmix64(shard.hash^hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether simply XORing these two values together is good enough at mixing them. This means that every bit of the digest's hash is only affected by exactly one bit of the server hash.
Is this a common way of combining hash values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xoring is a common way of preserving entropy, as long as the sources of randomness is uncorrelated (which the hash of the digest and the hash of the server key should be) the operation is guaranteed to only increase entropy. The mixing to the keyspace comes from splitmix64 which has very good randomness properties at a super slim runtime cost.
@@ -285,13 +285,21 @@ message ShardingBlobAccessConfiguration { | |||
// the data. | |||
uint64 hash_initialization = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this field is still needed? Now that shards are identified with a string, the potential concerns arising from layering multiple copies of ShardingBlobAccess don't really apply anymore, assuming each level uses a distinct set of keys (which it likely will).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed hash_initialization and the fnv algorithm in the latest version in favor of using the first 64 bits of the digest hash.
// A description of shard. The shard selector will resolve to the same shard | ||
// independent of the order of shards, but the returned index will correspond | ||
// to the index sent to the ShardSelectors constructor. | ||
type Shard struct { | ||
Key string | ||
Weight uint32 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type is only used by ShardingBlobAccess, right? ShardingBlobAccess does not need to be aware of the weights of individual shards.
Maybe it's better to add the following struct to sharding_blob_access.go:
type ShardBackend struct {
Backend blobstore.BlobAccess
Description string
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Shard type is used by the ShardSelector as well as the ShardingBlobAccess but I modified the ShardingBlobAccess to use a ShardBackend type instead.
3a38f82
to
4011c29
Compare
Improves the ability to perform resharding by switching the sharding struct from array to map. Each map entry has a key which is used in rendezvous hashing to deterministically select which shard to use from the collection of keys. When a shard is removed it is guaranteed that only blobs which belonged to the removed shard will resolve to a new shard. In combination with ReadFallbackConfigurations this allows adding and removing shards with minimal need to rebalance the blobs between the shards. See https://github.com/buildbarn/bb-adrs buildbarn#11 for more details.
4011c29
to
86f3bfb
Compare
As part of buildbarn/bb-adrs#5 we are attempting to improve the resharding experience. In this pull request we have an implementation of sharding based on Rendezvous Hashing where shards are described by a map instead of an array.