Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new 'quorum' blob access #213

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anguslees
Copy link

@anguslees anguslees commented Aug 15, 2024

Add a new 'quorum' blob access, that provides high availability on top of already-durable storage.

Quorum blob access only requires that a subset of backends are available in order to function successfully. The exact quorum size is configurable, but almost all cases should use "smallest integer greater than half" for both read and write quorum size. ie: 2 out of 3 backends for single-failure tolerance, or 3 out of 5 backends for double-failure tolerance.

Writes (Put) must succeed on at least write_quorum number of backends, and occur in parallel. Reads (Get) must see at least read_quorum number of not-found responses before concluding the blob does not exist, and occur sequentially. FindMissing reads are performed in parallel on read_quorum number of backends, and results are merged.

Note: blobs are not replicated again after the initial Put, so the underlying storage should be durable.

Add a new 'quorum' blob access, that provides high availability on top of
already-durable storage.

Quorum blob access only requires that a subset of backends are available in
order to function successfully.  The exact quorum size is configurable, but
almost all cases should use "half the number of backends, rounded up" for both
read and write quorum size.  ie: 2 out of 3 backends for single-failure
tolerance, or 3 out of 5 backends for double-failure tolerance.

Writes (Put) must succeed on at least write_quorum number of backends, and occur
in parallel.  Reads (Get) must see at least read_quorum number of not-found
responses before concluding the blob does not exist, and occur sequentially.
FindMissing reads are performed in parallel on read_quorum number of backends,
and results are merged.

Note: blobs are not replicated again after the initial Put, so the underlying
storage should be durable.
Copy link
Member

@EdSchouten EdSchouten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's unclear why one would want to use a backend like this. The way MirroredBlobAccess are designed/implemented right now, you get fairly decent storage density. You "only" lose 50% of space. Well, slightly more if you want to keep some nodes on hot standby. It's possible to lose less than 50% if you resort to Reed-Solomon encoding, but for the small objects we often have to work with that's likely too much overhead. This backend only lets you achieve 100%/n = 33%, 25%, 20%, .... I fail to see how that's economic.

The reason MirroredBlobAccess requires both backends to be online is because we rely on battle tested tools like Kubernetes to do the health checking/backend migration/service resolution for us. The advantage of that is that Buildbarn can stay simple, and that you as a cluster administrator can use tools like kubectl, kube-state-metrics, etc. to get decent insight in what's going on.

Now you may think: "I use bare metal hardware. Having a dependency on Kubernetes is unreasonable." Well, my response to that is that you can keep your cluster as simple as you want. It doesn't need to be as complete as the stock clusters offered by cloud providers. You can get pretty far by running etcd, kube-apiserver, kube-controller-manager and kube-scheduler on a single server, and kubelet on all of your storage nodes. There is no need to set up a CNI, nor do you need kube-proxy/coredns, as Buildbarn binaries are nowadays capable of resolving endpoints using kuberesolver.

Comment on lines +167 to +171
readQuorum := int(backend.Quorum.ReadQuorum)
writeQuorum := int(backend.Quorum.WriteQuorum)
if readQuorum + writeQuorum <= len(backends) {
return BlobAccessInfo{}, "", status.Errorf(codes.InvalidArgument, "Quorum blob access requires read_quorum + write_quorum > number of backends")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that int may not always be large enough to represent all values of uint32. Also note that there are more constraints that need to be applied. I would write the following instead:

if l := uint32(len(backends)); readQuorum < 1 || readQuorum > l || writeQuorum < 1 || writeQuorum > l || readQuorum + writeQuorum <= l {
  ...
}

Then later only convert the values to int as part of the call to NewQuorumBlobAccess.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I felt it was far more likely for someone to mistakenly put a small negative number in here rather than a super large positive number - so I went with a uint in the proto. I agree I don't like casting int types and I'm happy to change this to a int throughout. (Or whatever other solution you would prefer)

Comment on lines +157 to +162
if combinedDigestKeyFormat == nil {
combinedDigestKeyFormat = &backend.DigestKeyFormat
} else {
newDigestKeyFormat := combinedDigestKeyFormat.Combine(backend.DigestKeyFormat)
combinedDigestKeyFormat = &newDigestKeyFormat
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was probably copy-pasted from the ShardingBlobAccess code. But that code only needs to use a pointer to keep track of whether the value is actually set, because not all backends may be undrained.

My recommendation would be to not let combinedDigestKeyFormat be a pointer here. Then do this:

for i, b := range ... {
     if i == 0 {
         combinedDigestKeyFormat = backend.DigestKeyFormat
     } else {
         combinedDigestKeyFormat = combinedDigestKeyFormat.Combine(backend.DigestKeyFormat)
     }
} 


func (ba *quorumBlobAccess) shuffledBackends() []blobstore.BlobAccess {
backends := make([]blobstore.BlobAccess, len(ba.backends))
copy(backends, ba.backends)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backends := append([]blobstore.BlobAccess(nil), ba.backends...)


group.Go(func() error {
var err error
for backend := range backendQueue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? So we launch n coroutines, and each of those attempt to send n requests?

group.Go(func() error {
var err error
for backend := range backendQueue {
err = backend.Put(ctx, digest, b1)
Copy link
Member

@EdSchouten EdSchouten Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlobAccess.Put() takes ownership of b1. You can't call it again with the same buffer. The reason being that it could have converted the buffer to a reader and already consumed some of the data inside. We can't rewind, as we may be streaming data coming from the client/server directly.

Think of what would happen if a user tries to upload a 10 GB blob. There is no way we can hold all of that data in memory in a single buffer, and repeatedly attempt to store it in a bunch of backends.

Comment on lines +206 to +211
// Find intersection of all results
missingFromAll := results[0]
for _, result := range results[1:] {
_, missingFromAll, _ = digest.GetDifferenceAndIntersection(missingFromAll, result)
}
return missingFromAll, nil
Copy link
Member

@EdSchouten EdSchouten Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine I have 5 storage backends, named 0 to 4. Read quorum is set to 4.

  • I first issue a Put(). Due to network flakiness, the blob only ends up getting written to backend 0.
  • Then I issue FindMissingBlobs(), which gets run against backends 0, 1, 2, and 3. Backend 0 reports it as present, while backends 1, 2 and 3 report it as absent. As we compute the intersection, we report it as being present.
  • Then I issue Get(), which gets run against backends 1, 2, 3 and 4. All of those return NOT_FOUND.

This is in violation of the protocol.


func (ba *quorumBlobAccess) Get(ctx context.Context, digest digest.Digest) buffer.Buffer {
return ba.get(func(b blobstore.BlobAccess) buffer.Buffer {
return b.Get(ctx, digest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you deal with servers being down in such a way that calls to Get() spend a lot of time hanging, e.g. waiting for a TCP connection to establish?

@EdSchouten
Copy link
Member

Note: blobs are not replicated again after the initial Put, so the underlying storage should be durable.

I don't think that's a very desirable property. What should users do if the underlying storage nodes do get out of sync (e.g., due to disks in a server getting replaced)? Do we provide any tools to do the copying?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants