Author: Ed Schouten
Date: 2020-04-16
The Remote Execution protocol that is implemented by Buildbarn defines two data stores: the Action Cache (AC) and the Content Addressable Storage (CAS). The CAS is where almost all data is stored in terms of size (99%+), as it is used to store input files and output files of build actions, but also various kinds of serialized Protobuf messages that define the structure of build actions.
As the name implies, the CAS is content addressed, meaning that all objects are identified by their content, in this case a cryptographic checksum (typically SHA-256). This permits Buildbarn to automatically deduplicate identical objects and discard objects that are corrupted.
Whereas most objects stored in the AC are all similar in size (kilobytes), sizes of objects stored in the CAS may vary a lot. Small Directory objects are hundreds of bytes in size, while container layers created with rules_docker may consume a gigabyte of space. This is observed to be problematic for Buildbarn setups that use sharded storage. Each shard will receive a comparable number of requests, but the amount of network traffic and CPU load to which those requests translate, fluctuates.
Buildbarn has code in place to checksum validate data coming from untrusted sources. For example, bb_worker validates all data received from bb_storage. bb_storage validates all data obtained from clients and data read from disk. This is done for security reasons, but also to ensure that bugs in storage code don't lead to execution of malformed build actions. Unfortunately, there are an increasing number of use cases where partial reads need to be performed (lazy-loading file systems for bb_worker, Bazel's ability to resume aborted transfers, etc.). Such partial reads currently still cause entire objects to be read, simply to satisfy the checksum validation process. This is wasteful.
To be able to address these issues, this ADR proposes the addition to some new features to Buildbarn's codebase to be able to eliminate the existence of such large CAS objects altogether.
In a nutshell, the idea is to partition files whose size exceed a certain threshold into smaller blocks. Each of the blocks will have a fixed size, except the last block in a file, which may be smaller. The question then becomes what naming scheme is used to store them in the CAS.
The simplest naming scheme would be to give all blocks a suffix that
indicates the offset in the file. For example, a file that is decomposed
into three blocks could use digests containing hashes <HASH>-0
,
<HASH>-1
and <HASH>-2
, where <HASH>
refers to the original file
hash. Calls to BlobAccess.Get()
, Put()
and FindMissing()
could be
implemented to expand to a series of digests containing the block number
suffix.
Unfortunately, this approach is too simplistic. Because <HASH>
refers
to the checksum of the file as a whole, there is no way the consistency
of individual blocks can be validated. If blocks are partitioned across
shards, there is no way an individual shard can verify the integrity of
its own data.
A solution would be to store each of the blocks using their native digest, meaning a hash of each of them is computed. This on its own, however, would only allow storing data. There would be no way to look up blocks afterwards, as only the combined hash is known. To be able to retrieve the data, we'd to store a separate manifest that contains a list of the digests of all blocks in the file.
Even though such an approach would allow us to validate the integrity of individual blocks of data, conventional hashing algorithms wouldn't allow us to verify manifests themselves. There is no way SHA-256 hashes of individual blocks may be combined into a SHA-256 hash of the file as a whole. SHA-256's Merkle-Damgård construction does not allow it.
As part of their BuildXL infrastructure, Microsoft published a hashing algorithm called "VSO-Hash" or "Paged File Hash". VSO-Hash works by first computing SHA-256 hashes for every 64 KiB page of data. Then, 32 page hashes (for 2 MiB of data) are taken together and hashed using SHA-256 to form a block hash. Finally, all block hashes are combined by folding them.
By letting clients use VSO-Hash as a digest function, Buildbarn could be altered to decompose blobs either in 64 KiB or 2 MiB blocks, or both (i.e., decomposing into 64 KiB chunks, but having two levels of manifests). This would turn large files into Merkle trees of depth two or three. Pages, blocks and their manifests could all be stored in the CAS and validated trivially.
Some work has already been done in the Bazel ecosystem to facilitate this. For example, the Remote Execution protocol already contains an enumeration value for VSO-Hash, which was added by Microsoft in 2019. There are, however, various reasons not to use VSO-Hash:
- Microsoft seemingly only added support for VSO-Hash as a migration aid. There are efforts to migrate BuildXL from VSO-Hash to plain SHA-256. The author has mentioned that support for VSO-Hash will likely be retracted from the protocol once this migration is complete.
- Bazel itself doesn't implement it.
- It is somewhat inefficient. Because the full SHA-256 algorithm is used at every level, the Merkle-Damgård construction is used excessively. The algorithm could have been simpler by defining its scheme directly on top of the SHA-256 compression function.
- Algorithms for validating files as a whole, (manifests of) pages and (manifests of) blocks are all different. Each of these needs a separate implementation, as well as a distinct digest format, so that storage nodes know which validation algorithm to use upon access.
- It effectively leaks the decomposition strategy into the client. VSO-Hash only allows decomposition at the 64 KiB and 2 MiB level. Decomposition at any other power of two would require the use of an entirely different hashing algorithm.
At around the time this document was written, a new hashing algorithm was published, called BLAKE3. As the name implies, BLAKE3 is a successor of BLAKE, an algorithm that took part in the SHA-3 competition. In the end BLAKE lost to Keccak, not because it was in any way insecure, but because it was too similar to SHA-2. The goal of the SHA-3 competition was to have a backup in case fundamental security issues in SHA-2 are found.
An interesting improvement of BLAKE3 over its predecessor BLAKE2 is that it no longer uses the Merkle-Damgård construction. Instead, chaining is only used for chunks up to 1 KiB in size. When hashing objects that are larger than 1 KiB, a binary Merkle tree of 1 KiB chunks is constructed. The final hash value is derived from the Merkle tree's root node. If clients were to use BLAKE3, Buildbarn would thus be able to decompose files into blocks that are any power of two, at least 1 KiB in size.
BLAKE3 uses an Extendable-Output Function (XOF) to post-process the state of the hasher into an output sequence of arbitrary length. Because this process invokes the BLAKE3 compression function, it is not reversible. This means that manifests cannot contain hashes of its blocks, as that would not allow independent integrity checking of the manifest. Instead, manifest entries should hold the state of the hasher (i.e., the raw Merkle tree node). That way it's possible both to validate and convert to a block digest.
For chunk nodes (that represent 1 KiB of data or less), the hasher state can be stored in 97 bytes (256 bit initialization vector, 512 bit message, 7 bit size, 1 bit chunk-start flag). For parent nodes (that represent more than 1 KiB of data), the hasher state is only 64 bytes (512 bit message), as other parameters of the compression function are implied. Because of that, decomposition of files into 1 KiB blocks should be discouraged. Buildbarn should only support decomposition into blocks that are at least 2 KiB in size. That way manifests only contain 64 byte entries, except for the last entry, which may be 64 or 97 bytes in size.
The BLAKE3 compression function takes a t
argument that is for two
different purposes. First of all, it is used as a counter for every 512
bits of hash output generated by the XOF. Secondly, it contains the
Chunk Counter when compressing input data. Effectively, the Chunk
Counter causes every 1 KiB chunk of data to be hashed in a different way
depending on the offset at which it is stored within the file.
For our specific use case, it is not desirable that hashing of data is offset dependent. It would require that decomposed blocks contain additional metadata that specify at which offset the data was stored in the original file. Otherwise, there would be no way to validate the integrity of the block independently. It also rules out the possibility of deduplicating large sections of repetitive data (e.g., deduplicating a large file that contains only null bytes to just a single chunk).
According to section 7.5 of the BLAKE3 specification, the Chunk Counter is not strictly necessary for security, but discourages optimizations that would introduce timing attacks. Though timing attacks are a serious problem, we can argue that in the case of the Remote Execution protocol such timing attacks already exist. For example, identical files stored in a single Directory hierarchy will only be uploaded to/downloaded from the CAS once.
For this reason, this ADR proposes adding support for a custom hashing algorithm to Buildbarn, which we will call BLAKE3ZCC. BLAKE3ZCC is identical to regular BLAKE3, except that it uses a Zero Chunk Counter. BLAKE3ZCC generates exactly the same hash output as BLAKE3 for files that are 1 KiB or less in size, as those files fit in a single chunk.
All of the existing hashing algorithms supported by the Remote Execution
protocol have a different hash size. Buildbarn makes use of this
assumption, meaning that if it encounters a Digest
message whose hash
value is 64 characters in size, it assumes it
refers to an object that is SHA-256 hashed. Adding support for BLAKE3ZCC
would break this assumption. BLAKE3 hashes may have any size, which
makes matching by length impossible. The default hash length of 256 bits
would also collide with SHA-256.
To solve this, we could extend Digest to make it explicit which hashing
algorithm is used. The existing string hash = 1
field could be
replaced with the following:
message Digest {
oneof hash {
// Used by all of the existing hashing algorithms.
string other = 1;
// Used to address BLAKE3ZCC hashed files, or individual blocks in
// case the file has been decomposed.
bytes blake3zcc = 3;
// Used to address manifests of BLAKE3ZCC hashed files, containing
// Merkle tree nodes of BLAKE3ZCC hashed blocks. Only to be used by
// Buildbarn, as Bazel will only request files as a whole.
bytes blake3zcc_manifest = 4;
}
...
}
By using type bytes
for these new fields instead of storing a base16
encoded hash in a string
, we cut down the size of Digest objects by
almost 50%. This causes a significant space reduction for Action,
Directory and Tree objects.
Unfortunately, oneof
cannot be used here, because Protobuf
implementations such as go-protobuf
don't guarantee that fields are serialized in tag order when oneof
fields are used. This property is required by the Remote Execution
protocol. To work around this, we simply declare three separate fields,
where implementations should ensure only one field is set.
message Digest {
string hash_other = 1;
bytes hash_blake3zcc = 3;
bytes hash_blake3zcc_manifest = 4;
...
}
Digests are also encoded into pathname strings used by the ByteStream
API. To distinguish BLAKE3ZCC files and manifests from other hashing
algorithms, B3Z:
and B3ZM:
prefixes are added to the base16 encoded
hash values, respectively.
In addition to that, a BLAKE3ZCC
constant is added to the
DigestFunction enum, so that Buildbarn can announce support for
BLAKE3ZCC to clients.
Even though BLAKE3 only came out recently,
one library for computing BLAKE3 hashes in Go exists.
Though this library is of good quality, it cannot be used to compute
BLAKE3ZCC hashes without making local code changes. Furthermore, this
library only provides public interfaces for converting a byte stream to
a hash value. In our case we need separate interfaces for converting
chunks to Merkle tree nodes, computing the root node for a larger Merkle
tree and obtaining Merkle tree nodes to hash values. Without such
features, we'd be unable to generate and parse manifests. We will
therefore design our own BLAKE3ZCC hashing library, which we will
package at github.com/buildbarn/bb-storage/pkg/digest/blake3zcc
.
Buildbarn has an internal util.Digest
data type that extends upon the
Remote Execution Digest message by storing the instance name, thereby
making it a fully qualified identifier of an object. It also has many
operations that allow computing, deriving and transforming them. Because
supporting BLAKE3ZCC and decomposition makes this type more complex, it
should first be moved out of pkg/util
into its own pkg/digest
package.
In addition to being extended to support BLAKE3ZCC hashing, the Digest data type will gain a new method:
func (d Digest) ToManifest(blockSizeBytes int64) (manifestDigest Digest, parser ManifestParser, ok bool) {}
When called on a digest object that uses BLAKE3ZCC that may be decomposed into multiple blocks, this function returns the digest of its manifest. In addition to that, it returns a ManifestParser object that may be used to extract block digests from manifest payloads or construct them. ManifestParser will look like this:
type ManifestParser interface {
// Perform lookups of blocks on existing manifests.
GetBlockDigest(manifest []byte, off int64) (blockDigest Digest, actualOffset int64)
// Construct new manifests.
AppendBlockDigest(manifest *[]byte, block []byte) Digest
}
One implementation of ManifestParser for BLAKE3ZCC shall be provided.
The Digest.ToManifest()
function and its resulting ManifestParser
shall be used by a new BlobAccess decorator, called
DecomposingBlobAccess. The operations of this decorator shall be
implemented as follows:
BlobAccess.Get()
will simply forward the call if the provided digest does not correspond to a blob that can be decomposed. Otherwise, it will load the associated manifest from storage. It will then return a Buffer object that dynamically loads individual blocks from the CAS when accessed.BlobAccess.Put()
will simply forward the call if the provided digest does not correspond to a blob that can be decomposed. Otherwise, it will decompose the input buffer into smaller buffers for every block and write those into the CAS. In addition to that, it will write a manifest object into the CAS.BlobAccess.FindMissing()
will forward the call, except that digests of composed objects will be translated to the digests of their manifests. Manifests that are present will then be loaded from the CAS, followed by checking the existence of each of the individual blocks.
To be able to implement DecomposingBlobAccess, the Buffer layer will need two minor extensions:
BlobAccess.Get()
needs to return a buffer that needs to be backed by the concatenation of a sequence of block buffers. A new buffer type that implements this functionality shall be added, which can be created by callingNewCASConcatenatingBuffer()
.BlobAccess.Put()
needs to read the input buffer in chunks equal to the block size. TheBuffer.ToChunkReader()
function currently takes a maximum chunk size argument, but provides no facilities for specifying a lower bound. This mechanism should be extended, so thatToChunkReader()
can be used to read input one block at a time.
With all of these changes in place, Buildbarn will have basic support for decomposing large objects in place.
With the features proposed above, support for decomposition should be complete enough to provide better spreading of load for sharded setups. In addition to that, workers with lazy-loading file systems will be able to perform I/O on large files without faulting them in entirely. There are still two areas where performance may be improved further:
- Using BatchedStoreBlobAccess, workers currently call
FindMissing()
before uploading output files. This is done at the output file level, which is suboptimal. When decomposition is enabled, it causes workers to load manifests from storage, even though those could be derived locally. A single absent block will cause the entire output file to be re-uploaded. To prevent this,BlobAccess.Put()
should likely be decomposed into two flavours: one for streaming uploads and one for uploads of local files. - When decomposition is enabled,
FindMissing()
will likely become slower, due to it requiring more roundtrips to storage. This may be solved by adding yet another BlobAccess decorator that does short-term caching ofFindMissing()
results.