-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blob split and splice API #282
base: main
Are you sure you want to change the base?
Conversation
Though this proposal may solve certain annoyances, it does not fix some other issues:
I would rather see that we try to solve these issues as part of REv3, by working towards eliminating the existence of large CAS objects in their entirety. See this section in the proposal I've been working on: |
This is quite similar to https://github.com/bazelbuild/remote-apis/pull/233/files. @roloffs have you had a chance to review the PR prior and related discussions? I think both PRs approaching this by adding a separate RPC, which is good and V2 compatible. |
@EdSchouten, thanks for sharing the link to your REv3 discussion document and your comments about this proposal. Sorry for not being aware of this document. After going through it, I agree with you that this API extension would not make much sense in REv3 given your proposal in the "Elimination of large CAS objects" section. However, as also @sluongng stated, since this extension is conservative and backwards compatible to REv2 and the release of REv3 is very uncertain right now, it would not harm people that do not use it, but would provide advantages already now for people who use it and could also lead to insights for your content-defined chunking ideas for REv3, since we also used such an algorithm to split blobs. I also agree with your concerns that uploading large files is something not covered by this proposal, while there exist relevant use-cases for this. However, I can think of a symmetric SpliceBlob rpc to allow for splitting a large blob on the client side, uploading only those parts of this blob that are missing on the server side, and then splicing there. This could be added in this PR as well. @sluongng, thanks for pointing out this PR. Despite the fact they look very similar they actually target complementary goals. Let me explain why. While the PR from @EdSchouten introduces split and combine blobs rpcs, the goal is not to safe traffic but to introduce a blob splitting scheme, which allows to verify the integrity of a blob by validating the digests of its chunks without actually reading the whole chunk data. In order to achieve this, he introduced a new digest function SHA256TREE, which allows recursive digest calculation. I hope I did not completely misunderstood your intention @EdSchouten. In contrast, the presented splitting scheme targets reuse as much as possible with the final goal of traffic reduction between client and server. E.g., if a large binary in the remote CAS was just modified slightly and you want to use it locally, you would have to download it completely. Using the presented extension, only the binary differences between the two versions determined by content-defined chunking would have to be downloaded, which is typically much less than the whole data. As I said both splitting schemes are actually complementary and follow different goals. |
I think what's missing in this PR was a specification regarding how the splitting algorithm would look like, and the ability to choose different algorithms for the job. In #233 , the chunking algorithm was mixed with the Digest algorithm, which I think is a good start as it's customizable. But I definitely can see cases where the Digest algorithm and Chunking algorithm are separated for different combinations (I.e. reed solomon + blake3, FastCDC + SHA256, delta compression + GITSHA1 etc...). And each combination could serve different purposes (deduplication, download parallelization, etc...). It would be nice if you could provide a bit more detail regarding your splitting algorithm of choice as an option here. |
While the actual choice of the splitting algorithm is mainly an implementation detail of the remote-execution endpoint (which of course affects the quality of the split result), the essential property of a server is to provide certain guarantees to a client if it successfully answers a
Besides this guarantee, in order to increase the reuse factor as much as possible between different versions of a blob, it makes sense to implement a content-defined chunking algorithm. They typically result in chunks of variable size and are insensitive to the data-shifting problem of fixed-size chunking. Such content-defined chunking algorithms typically rely on a rolling-hash function to efficiently compute hash values of consecutive bytes at every byte position in the data stream in order to determine the chunk boundaries. Popular algorithms for content-defined chunking are:
I have selected FastCDC as chunking algorithm for the endpoint implementation in our build system, since it has been proven to be very compute efficient and faster than the other rolling-hash algorithms while achieving similar deduplication ratios as the Rabin fingerprint. We already observed reuse factors of 96-98% for small changes, when working with big file-system images (around 800 MB) and also of 75% for a 300 MB executable with debug information. Maybe, you want to have a look at our internal design document for more information about this blob-splitting API extension. |
Ah I think I have realized what's missing here. Your design seems to be focusing on splitting the blob on the server side for the client to download large blobs. While I was thinking that blob splitting could happen to both the client side and the server side. For example: a game designer may work on some graphic assets, say a really large picture. Subsequent versions of a picture may get chunked on the client side. Then the client can compare the chunk list with the chunks that are already available on the server-side, and only upload the parts that are missing. So in the case where both client and server have to split big blobs for efficient download AND upload, it's beneficial for 2 sides to agree upon how to split (and put back together) big blobs. |
Yes, you are right, this design currently focuses on splitting on the server side and downloading large blobs, but as mentioned in a comment above, I am willing to extend this design proposal by a Maybe it is worth to mention that in this case, it is not necessarily required for client and server to agree upon the same splitting algorithm, since after the first round-trip overhead, the chunking algorithm for each direction anyway ensures an efficient reuse. I will update this proposal to handle uploads for you to review. Thank you very much for your interest and nice suggestions. |
Do keep in mind that there could be mixed usage of clients (a) with chunking support and clients (b) without chunking support. So I do believe a negotiation via the initial GetCapability RPC, similar to the current Digest and Compressor negotiation, is much desirable. As the server would need to know how to put a split blob upload, from (a), back together to serve it to (b). I would recommend throwing the design ideas into #178. It's not yet settled whether chunking support needs to be a V3 exclusive feature, or we could do it as part of V2. Discussion to help nudge the issue forward would be much appreciated. |
186240b
to
b74bac8
Compare
@sluongng I have updated the PR with a more sharpened description of what is meant by and what is the goal of this blob-splitting approach and a proposal for the chunked upload of large blobs. Some thoughts about your hints regarding the capabilities negotiation between client and server:
This means, each side is responsible for its chunking approach without having the other side to know about it. The other side just needs to be able to concatenate the chunks. Furthermore, it would be difficult to agree, e.g., on the same FastCDC algorithm, since this algorithm internally depends on an array of 256 random numbers (generated by the implementer) and thus could result in completely different chunk boundaries for two different implementations preventing any reuse between the chunks on the server and the client. I will also put a summary of this blob splitting and splicing concept into #178. Would be nice if this concept could find its way into REv2 since it is just an extension free to use and no invasive modification. |
Do give #272 and my draft PR a read on how client/server could negotiate for a spec leveraging GetCapabilities rpc. Could be useful if you want to have a consistent splitting scheme between client and server. |
// The ordered list of digests of the chunks which need to be concatenated to | ||
// assemble the original blob. | ||
repeated Digest chunk_digests = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should design this SpliceBlobRequest
as part of BatchUpdateBlobsRequest
.
The problem is that we are not pushing the chunk blobs in this request, only sending the CAS server metadata regarding combining some blobs into a larger blob. There could be a delay between the BatchUpdateBlobs RPC call and the SpliceBlob RPC call. That delay could be a few mili-seconds, or it could be weeks, or months after some of the uploaded chunks have expired from the CAS server. There is no transaction guarantee between the 2 RPCs.
So a way to have some form of transaction guarantee would be to send this as part of the same RPC that uploads all the blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we are not pushing the chunk blobs in this request, only sending the CAS server metadata regarding combining some blobs into a larger blob.
But that is the overarching principle behind the whole protocol that we upload blobs and later refer to them by their digests. All that works, because the CAS promises to keep all blobs fresh (i.e., not forget about them for a reasonable amount of time) where its answer implies it knows about them (that could be the answer to a FindMissingBlobs
request or a success statement to a blob upload request). The typical workflow for a client using this request would anyway be to split the blob locally, use FindMissingBlobs
to find out which blobs are not yet known to the CAS, then (batch) upload only the ones not yet known to the CAS (that's where the savings in traffic come from) and then request a splice of all of them. All this works becasue the promise of the CAS to keep referenced objects alive.
To give a prominent example where the protocol is already relying on that guarantee to keep objects alive, consider the request to execute an action. That request does not upload any blobs, yet still expects them to be there because a recent interaction with the CAS showed they are in the CAS already. In a sense, that example also shows that blob splicing is nothing fundamentally new, but just an optimisation: the client could already now request an action calling cat
be executed—in a request that is independent of the blob upload. However, making it explicitly an operation on the CAS gives a huge room for optimization: no need to spawn an action-execution environment, the CAS knows ahead of time hash and size of the blob that is to be stored as a result of that request, if it known the blob in question it does not even have to do anything (apart from keeping that blob fresh), etc.
Hello @sluongng, I have updated the proposal by using the capabilities service as you have proposed. It is now possible for a client to determine the supported chunking algorithms at the server side and select one at a SplitBlob request. By this means, the client can select one that it also uses locally so that both communication directions benefit from the available chunking data on each side. Furthermore, I have added some comments about lifetime of chunks. Thanks for your time reviewing this PR! |
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
Hello all, after spending quite some time working on this proposal and its implementation, I have finished incorporating all suggestions made by the reviewers and that came up during the working group meeting. Finally, the following high-level features would be added to the REv2 protocol:
This whole proposal is fully implemented in our own remote-execution implementation in justbuild: and used by the just client: From my side, this proposal is finished and ready for final review. What I would like to know from you is what needs now to be done that this proposal finally gets merged into main. I can also summarize it again at the next working group meeting and at best would like to know a decision how to proceed with this proposal. Thank you very much for your efforts. |
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already effectively have the ability to splice blobs by using the bytestream API with read_offset and read_limit?
// The digest of the blob to be splitted. | ||
Digest blob_digest = 2; | ||
|
||
// The chunking algorithm to be used. Must be IDENTITY (no chunking) or one of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead reject IDENTITY as an invalid argument? I imagine this would only be used by broken clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that, I have basically copied the pattern from PR #276 to include a sane default value. I leave that open to your decision, I have no objections to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this field from IDENTITY to DEFAULT, because as you mentioned IDENTITY does not really makes sense being requested by a client. Instead, to provide a proper default value for the chunking algorithm enum, I have introduced DEFAULT, which means the client does not care, which exact chunking algorithm is used by the server, just use the default implementation. If a client wants to negotiate more explicitly about the used chunking algorithm, it should specify one of the other enum values that are supported and advertised by the server.
I hope, this resolves your concerns? @mostynb
@mostynb, as far as I have understood the protocol, no. While the bytestream API with read_offset and read_limit allows you to partially read the content of a blob, it does not allow you to create a new blob from a batch of other blobs (its chunks) at the remote CAS. The goal of blob splicing is that if a client regularly uploads slightly different large objects to the remote CAS, only the binary differences between the versions are needed to be uploaded and not the entire block of binary data every time. To achieve this, the client needs to split the large object into reusable chunks (which is typically done by content-defined chunking) and just uploads the chunks (handled as blobs) that are missing at the remote CAS, which are normally a lot when uploading the first time. If the client needs to upload this large object again, but a slightly different version of it (meaning only a percentage of the binary data has been changed), he again splits it into chunks and tests which chunks are missing at the remote CAS. Normally, content-defined chunking splits the binary data that hasn't been changed into the same set of chunks, only where binary differences occur, different chunks will be created. This means, only a fraction of the whole set of chunks need to be uploaded to the remote CAS in order to be able to reconstruct the second version of the large object at the remote CAS. The actual reconstruction of a large blob at the remote side is done using the splice command with a description of which chunks need to be concatenated (a list of chunk digests available at the remote CAS). The split operation works exactly the other way around, when you regularly download an ever changing large object from remote CAS. Then, the server splits the large object into chunks, the client fetches only the locally missing chunks and reconstructs the large object locally from the locally available chunks. Finally, to exploit chunking for both directions at the same time, it makes sense that the client and the server agree on a chunking algorithm to allow reusing chunks created on both sides. For this, we added a negotiation mechanism to agree on the chunking algorithm used on both sides. |
// (Algorithm 2, FastCDC8KB). The algorithm is configured to have the | ||
// following properties on resulting chunk sizes. | ||
// - Minimum chunk size: 2 KB | ||
// - Average chunk size: 8 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that using small chunk sizes, such as 8 KB, can increase the likelihood of deduplication and may also reduce the risk of disk storage fragmentation. However, have you considered if there is potential performance overhead of having too many fine-grained CAS blobs?
I can envision that a feature like this could also be beneficial in distributing the load more evenly across multiple CAS shards. But for such use cases, it might make sense to use much larger chunks, perhaps 8 MB? Should we somehow accommodate also for larger chunks in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We experimented with FastCDC on approximately 5TB of real Bazel data from many different codebases and found that 0.5MB is a good trade-off between space savings and metadata overhead. Too small of a chunk size means the metadata for all chunks becomes very large / numerous, while too large of a chunk size means poor space savings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luluz66, I think 0.5 MB is more reasonable than 8 KB.
Do you think there is one value that would fit all, or should a size like this be allowed to be tuned in a flexible way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @luluz66 for your experiments. Indeed, we did not evaluate storage consumption trade-offs since we were mainly interested in traffic reduction. I think, 500 KB of average chunk size is a sane default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 MB was the ideal range for us based on the Bazel-specific data set we were testing against. However, there would be no telling whether that number would be different for a different client/server pair, or a different data set.
So I think we would want a discovery mechanism for the FastCDC configuration on the server side. The client should follow the server's advertised setting in order to achieve the best result. WDYT?
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
Depending on the software project, possibly large binary artifacts need to be downloaded from or uploaded to the remote CAS. Examples are executables with debug information, comprehensive libraries, or even whole file system images. Such artifacts generate a lot of traffic when downloaded or uploaded. The blob split API allows to split such artifacts into chunks at the remote side, to fetch only those parts that are locally missing, and finally to locally assemble the requested blob from its chunks. The blob splice API allows to split such artifacts into chunks locally, to upload only those parts that are remotely missing, and finally to remotely splice the requested blob from its chunks. Since only the binary differences from the last download/upload are fetched/uploaded, the blob split and splice API can significantly save network traffic between server and client.
|
||
// Content-defined chunking using Rabin fingerprints. An implementation of | ||
// this scheme in presented in this paper | ||
// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is behind a paywall. Any chance we can link to a spec that is freely accessible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it this paper? http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_marculescu_andrei_1.pdf
Even though that paper provides a fairly thorough mathematical definition of how Rabin fingerprints work, it's not entirely obvious to me how it translates to an actual algorithm for us to use. Any chance we can include some pseudocode, or link to a publication that provides it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, that is the paper, I can update the link. I really spend some time trying to find a nice algorithmic description for the Rabin fingerprint method, but failed. Only I could find some real implementations on GitHub, but I assume this is not something, we would like to link here. There is also the original paper of the Rabin method http://www.xmailserver.org/rabin.pdf, but that one doesn't seem to help either. The paper above gives a reasonable introduction to how Rabin fingerprints works and even some thoughts about how to implement it, so I thought, that is the best source to link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the paper about FastCDC https://ieeexplore.ieee.org/document/9055082 also contains an algorithmic description of the Rabin fingerprinting technique (Algorithm 3. RabinCDC8KB). The only thing that is missing here is a precise description of the precomputed U and T arrays. Even though, they provide links to other papers, where these arrays are supposedly defined, I could not find any definition in these papers.
// - Maximum chunk size: 2048 KB | ||
// The irreducible polynomial to be used for the modulo divisions is the | ||
// following 64-bit polynomial of degree 53: 0x003DA3358B4DC173. The window | ||
// size to be used is 64 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ NOTE: I'm absolutely not an expert on content defined chunking algorithms! ]
Is a window size of 64 bits intentional? If I look at stuff like https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf, it seems they are using 48 bytes, while in their case they are aiming for min=2KB, avg=8KB, max=64KB. Shouldn't this be scaled proportionally?
To me it's also not fully clear why an algorithm like this makes a distinction between a minimum chunk size and a window size. Why wouldn't one simply pick a 128 KB window and slide over that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, that was a mistake from myside, what was meant is 64 bytes window size. I took this value from an existing implementation. Thanks for this finding!
The window size is a general parameter of a rolling hash function and the hash value or also fingerprint for a specific byte position is calculated for that window of bytes. Then, you move forward by one byte and calculate the hash value for this new window of bytes again. Thanks to the rolling-hash property, this process can be done very efficiently. So, the window size influences the hash value for a specific byte position and thus, the locations of actual chunk boundaries. Theoretically, we could use a window size of the minimum chunk size of 128 KB, but it is not common to use such a large window size in the implementations of content-defined chunking algorithms I have seen so far.
|
||
// Content-defined chunking using Rabin fingerprints. An implementation of | ||
// this scheme in presented in this paper | ||
// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it this paper? http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_marculescu_andrei_1.pdf
Even though that paper provides a fairly thorough mathematical definition of how Rabin fingerprints work, it's not entirely obvious to me how it translates to an actual algorithm for us to use. Any chance we can include some pseudocode, or link to a publication that provides it?
// implementation of this algorithm should be configured to have the | ||
// following properties on resulting chunk sizes. | ||
// - Minimum chunk size: 128 KB | ||
// - Average chunk size: 512 KB (0x000000000007FFFF bit mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the idea is that you process input byte for byte, compute a fingerprint over a sliding window, and create a chunk if the last 19 bits of the hash are all zeroes or ones (which is it?), are you sure that this will give an average chunk size of 512 KB? That would only hold if there was no minimum size, right? So shouldn't the average chunk size be 128+512 = 640 KB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the 19 lowest bits of the fingerprint need to be zero for the expression (fp & mask) to become 0, in which case you found a chunk boundary. Regarding your question about the average chunk size, you are right, the actual average chunk size = expected chunk size (512 KB) + minimum chunk size (128 KB). I will state this more clearly in the comments, thanks for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did some measurements, and it looks like using fp & mask == 0
is actually a pretty bad choice. The reason being that it's pretty easy to craft windows for which fp == 0
, namely a sequence consisting exclusively of 0-bytes. This has also been observed here:
https://ieeexplore.ieee.org/document/9006560
Although using a cut value of zero seems to be a natural choice in theory, it turns out to be a poor choice in practice (Figure 2).
I noticed this especially when creating chunks of a Linux kernel source tarball (decompressed), for the same reason as stated in the article:
It turns out that tar files use zeroes to pad out internal file structures. These padded structures cause an explosion of 4-byte (or smaller) length chunks if the cut value is also zero. In fact, over 98% of the chunks are 4-bytes long (Figure 2, (W16,P15,M16,C0) Table I).
I didn't observe the 4-byte chunks, for the reason that I used a minimum size as documented.
Results look a lot better if I use fp & mask == mask
.
// https://link.springer.com/chapter/10.1007/978-1-4613-9323-8_11. The final | ||
// implementation of this algorithm should be configured to have the | ||
// following properties on resulting chunk sizes. | ||
// - Minimum chunk size: 128 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there is a...
1 - ((2^19-1)/(2^19))^(128*1024) = 22.12...%
... probability that chunks actually contain a cutoff point that wasn't taken because it would violate the minimum chunk size? That probability sounds a lot higher than I'd imagine to be acceptable.
Consider the case where you inject some arbitrary data close to the beginning of a chunk that had a cutoff point within the first 128 KB that wasn't respected. If the injected data causes the cutoff point to be pushed above the 128 KB boundary, that would cause the cutoff point to be respected. This in turn could in its turn cause successive chunks to use different cutoff points as well.
Maybe it makes more sense to pick 16 KB or 32 KB here?
// following properties on resulting chunk sizes. | ||
// - Minimum chunk size: 128 KB | ||
// - Average chunk size: 512 KB (0x000000000007FFFF bit mask) | ||
// - Maximum chunk size: 2048 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming my understanding of the algorithm is correct, doesn't that mean that there is a...
((2^19-1)/(2^19))^((2048-128)*1024) = 2.352...%
... probability that chunks end up reaching the maximum size? This sounds relatively high, but is arguably unavoidable.
A bit odd that these kinds of algorithms don't attempt to account for this, for example by repeatedly rerunning the algorithm with a smaller bit mask (18, 17, 16, [...] bits) until a match is found. That way you get a more even spread of data across such chunks, and need to upload fewer chunks in case data is injected into/removed from neighbouring chunks that both reach the 2 MB limit.
// - Minimum chunk size: 128 KB | ||
// - Average chunk size: 512 KB | ||
// - Maximum chunk size: 2048 KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at algorithm 2 in that paper, I see that I can indeed plug in these values to MinSize, MaxSize, and NormalSize. But what values should I use for MaskS and MaskL now? (There's also MaskA, but that seems unused by the stock algorithm.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, as far as I have understood the algorithm, it is the normalized chunking technique that they use, which allows to keep these mask values as they are, but change the min/max/average chunk sizes according to your needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at algorithm 2 in that paper that attempts to compute 8 KB chunks, I think the probability of a random chunk having size at most x
should be as follows (link):
Now if I change that graph to use the minimum/average/maximum chunk sizes that you propose while leaving the bitmasks unaltered, I see this (link):
If I change it so that MaskS
has 21 bits and MaskL
has 17, then the graph starts to resemble the original one (link):
So I do think the masks need to be adjusted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Making the graphs above was meaningful. I think it gives good insight in why FastCDC was designed the way it is. They are essentially trying to mimic a normal distribution (link):
Using a literal normal distribution would not be desirable, because it means that the probability that a chunk is created at a given point not only depends on the data within the window, but also the size of the current chunk. And this is exactly what CDC tries to prevent. So that's why they emulate it using three partial functions.
Yeah, we should make sure to set MaskS
and MaskL
accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks @EdSchouten for this great analysis. I have to admit, I did not look into these mask values in that detail, because it also was not really explained in detail in the paper, but your concerns are absolutely right and we have to set the mask values accordingly, when the min/max/average chunk sizes are changed. I am just asking myself, since the paper authors mentioned they derived the mask values empirically, how should we adapt them?
@luluz66 , @sluongng since you mentioned you were working with the FastCDC algorithm with chunk sizes of 500 KB, I am wondering how you handled the adaption of the mask values or whether you did it at all. Can you share your experience here? Thank you so much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have our fork of CDC implementation here https://github.com/buildbuddy-io/fastcdc-go/blob/47805a2ecd550cb875f1b797a47a1a648a1feed1/fastcdc.go#L162-L172
This work is very much in progress and not final. We hope that the current API will come with configurable knobs so that downstream implementation could choose what's best for their use cases and data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just asking myself, since the paper authors mentioned they derived the mask values empirically, how should we adapt them?
I think that the authors of the paper just used the following heuristics:
- Don't use the bottom 16 bits, because those likely contain low quality data.
- Don't use the top 16 bits to get an equal comparison against many common Rabin fingerprinting implementations that use a 48 byte window.
(1) sounds like a good idea, but (2) can likely be ignored. So just pick 21 out of the top 48 bits, and then another 17 bits that are a subset of the former.
Hi @roloffs, Earlier today I spent some time experimenting with FastCDC. In addition to implementing Algorithm 2 "FastCDC8KB" from the paper, I also wrote something corresponding to the following:
The following thought has gone into this variant:
If implemented naively, the algorithm above will have a worse worst-case running time. Namely, we always compute With regards to performance of the chunking performed, I downloaded some different versions of the Linux kernel source code. Because the upstream tarballs contain timestamps, I unpacked them and concatenated all of the files contained within, which gave me some ~1.4 GB files, consisting mostly of text. I cut these files into ~10 KB chunks using both FastCDC8KB and the algorithm described above, giving me ~140k chunks per kernel. Comparing Linux 6.7.10 with Linux 6.8.1, I see that:
This means that the algorithm described above performs Would you by any chance be interested in trying to reproduce these results? I'd be interested in knowing whether these savings hold in general. |
@EdSchouten what do you think would be the next step here for this PR? To me, it seems like we have established the value of using FastCDC as one of the potential chunking algorithms. I think this means that we gonna need a mechanism for the server to advertise the desired FastCDC config, and the client to comply accordingly. As for the modified-FastCDC, if we cannot expose it as an discoverable configuration knobs, then we could add a new chunking algorithm after FastCDC is merged. WDYT? |
I have no opinion whatsoever what should happen here. As I mentioned during the last working group meeting, I have absolutely no intent to implement any of this on the Buildbarn side as part of REv2. I don't think that there is an elegant way we can get this into the existing protocol without making unreasonable sacrifices. For example, I care about data integrity. So with regards to what the next steps are, that's for others within the working group to decide. That said, I am more than willing to engage in more discussions how we should address this as part of REv3. First and foremost, I think that the methodology that is used to chunk object should not be part of the lower level storage. Files should be Merkle trees that are stored in the CAS in literal form. What methodology is used to chunk files should only need to be specified by clients to ensure that workers chunk files in a way that is consistent with locally created files. Therefore, the policy to chunk should in REv3 most likely be stored in its equivalent of the Command message. Not in any of the capabilities. |
FYI: If other people want to do some testing in this area, I have just released the source code for the algorithm described above: https://github.com/buildbarn/go-cdc |
Hey @roloffs , We had a monthly REAPI meeting this week and the maintainers have concluded that we should push this PR forward.
With that said, please let me know if you still have the capacity to work on this PR. cc: @buchgr @EdSchouten |
Hello @sluongng, sorry for not being responsive for a longer time, I was on parental leave from work for three months and will catch up everything during the next days. I am willing to finish this PR and also have the capacity to do this from now on. Still, if you are willing to support, it would be appreciated since I have to consider and incorporate all great comments from @EdSchouten. Today, there is a Remote Execution API Working Group Meeting, however, I won't attend since there is not much to report. I will do my best to finish everything until the next meeting in August. |
This is a proposal of a conservative extension to the
ContentAddressableStorage
service, which allows to reduce traffic when blobs are fetched from the remote CAS to the host for local usage or inspection. With this extension it is possible to request a remote-execution endpoint to split a specified blob into chunks of a certain average size. These chunks are then stored in the CAS as blobs and the ordered list of chunk digests is returned. The client can then check, which blob chunks are available locally from earlier fetches and fetch only the missing chunks. By using the digest list, the client can splice the requested blob from the locally available chunk data.This extension could especially help to reduce traffic if large binary files are created at the remote side and needed locally such as executables with debug information, comprehensive libraries, or even whole file system images. It is a conservative extension, so no client is forced to use it. In our build-system project justbuild, we have implemented this protocol extension for server and client side.