Packed-refs v2 Part III: trailing table of contents in chunk-format #25

The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Since the commit-graph and multi-pack-index files both use this trailing hash, the chunk-format API uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. If we wish to use the chunk-format API to upgrade other file types, then this hashing is a performance penalty that might not be worth the benefit of a trailing hash. For example, if we create a chunk-format version of the packed-refs file, then the file format could shrink by using raw object IDs instead of hexadecimal representations in ASCII. That reduction in size is not enough to counteract the performance penalty of hashing the file contents. In cases such as deleting a reference that appears in the packed-refs file, that write-time performance is critical. This is in contrast to the commit-graph and multi-pack-index files which are mainly updated in non-critical paths such as background maintenance. One way to allow future chunked formats to not suffer this penalty would be to create an abstraction layer around the 'struct hashfile' using a vtable of function pointers. This would allow placing a different representation in place of the hashfile. This option would be cumbersome for a few reasons. First, the hashfile's buffered writes are already highly optimized and would need to be duplicated in another code path. The second is that the chunk-format API calls the chunk_write_fn pointers using a hashfile. If we change that to an abstraction layer, then those that _do_ use the hashfile API would need to change all of their instances of hashwrite(), hashwrite_be32(), and others to use the new abstraction layer. Instead, this change opts for a simpler change. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. For the commit-graph and multi-pack-index file, it may be possible to allow the null hash without incrementing the file format version, since it technically fits the structure of the file format. The only issue is that older versions would trigger a failure during 'git fsck'. For these file formats, we may want to delay such a change until it is justified. However, the index file is written in critical paths. It is also frequently updated, so corruption at rest is less likely to be an issue than in those other file formats. This could be a good candidate to create an option that skips the hashing operation. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] microsoft@21fed2d Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API. Disabling the hashwrite can be particularly helpful when the write operation is in a critical path. One such critical path is the writing of the index. This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster. Following a similar approach to one used in the microsoft/git fork [1], add a new config option that allows disabling this hashing during the index write. The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash. [1] microsoft@21fed2d While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure. Using this null hash will still allow Git operations to function across older versions. The one exception is 'git fsck' which checks the hash of the index file. Here, we disable this check if the trailing hash is all zeroes. We add a warning to the config option that this may cause undesirable behavior with older Git versions. As a quick comparison, I tested 'git update-index --force-write' with and without index.computHash=false on a copy of the Linux kernel repository. Benchmark 1: with hash Time (mean ± σ): 46.3 ms ± 13.8 ms [User: 34.3 ms, System: 11.9 ms] Range (min … max): 34.3 ms … 79.1 ms 82 runs Benchmark 2: without hash Time (mean ± σ): 26.0 ms ± 7.9 ms [User: 11.8 ms, System: 14.2 ms] Range (min … max): 16.3 ms … 42.0 ms 69 runs Summary 'without hash' ran 1.78 ± 0.76 times faster than 'with hash' These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

Git's reference storage is critical to its function. Creating new storage formats for references requires adding an extension. This prevents third-party tools that do not understand that format from operating incorrectly on the repository. This makes updating ref formats more difficult than other optional indexes, such as the commit-graph or multi-pack-index. However, there are a number of potential ref storage enhancements that are underway or could be created. Git needs an established mechanism for coordinating between these different options. The first obvious format update is the reftable format as documented in Documentation/technical/reftable.txt. This format has much of its implementation already in Git, but its connection as a ref backend is not complete. This change is similar to some changes within one of the patches intended for the reftable effort [1]. [1] https://lore.kernel.org/git/pull.1215.git.git.1644351400761.gitgitgadget@gmail.com/ However, this change makes a distinct strategy change from the one recommended by reftable. Here, the extensions.refFormat extension is provided as a multi-valued list. In the reftable RFC, the extension has a single value, "files" or "reftable" and explicitly states that this should not change after 'git init' or 'git clone'. The single-valued approach has some major drawbacks, including the idea that the "files" backend cannot coexist with the "reftable" backend at the same time. In this way, it would not be possible to create a repository that can write loose references and combine them into a reftable in the background. With the multi-valued approach, we could integrate reftable as a drop-in replacement for the packed-refs file and allow that to be a faster way to do the integration since the test suite would only need updates when the test is explicitly testing packed-refs. When upgrading a repository from the "files" backend to the "reftable" backend, it can help to have a transition period where both are present, then finally removing the "files" backend after all loose refs are collected into the reftable. But the reftable is not the only approach available. One obvious improvement could be a new file format version for the packed-refs file. Its current plaintext-based format is inefficient due to storing object IDs as hexadecimal representations instead of in their raw format. This extra cost will get worse with SHA-256. In addition, binary searches need to guess a position and scan to find newlines for a refname entry. A structured binary format could allow for more compact representation and faster access. Adding such a format could be seen as "files-v2", but it is really "packed-v2". The reftable approach has a concept of a "stack" of reftable files. This idea would also work for a stack of packed-refs files (in v1 or v2 format). It would be helpful to describe that the refs could be stored in a stack of packed-ref files independently of whether that is in file format v1 or v2. Even in these two options, it might be helpful to indicate whether or not loose ref files are present. That is one reason to not make them appear as "files-v2" or "files-v3" options in a single-valued extension. Even as "packed-v2" or "packed-v3" options, this approach would require third-party tools to understand the "v2" version if they want to support the "v3" options. Instead, by splitting the format from the layout, we can allow third-party tools to integrate only with the most-desired format options. For these reasons, this change is defining the extensions.refFormat extension as well as how the two existing values interact. By default, Git will assume "files" and "packed" in the list. If any other value is provided, then the extension is marked as unrecognized. Add tests that check the behavior of extensions.refFormat, both in that it requires core.repositoryFormatVersion=1, and Git will refuse to work with an unknown value of the extension. There is a gap in the current implementation, though. What happens if exactly one of "files" or "packed" is provided? The presence of only one would imply that the other is not available. A later change can communicate the list contents to the repository struct and then the reference backend could ignore one of these two layers. Specifically, having only "files" would mean that Git should not read or write the packed-refs file and instead only read and write loose ref files. By contrast, having only "packed" would mean that Git should not read or write loose ref files and instead always update the packed-refs file on every ref update. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the extensions.refFormat config option. It is a multi-valued config option that currently understands "files" and "packed", with both values assumed by default. If any value is provided explicitly, this default is ignored and the provided settings are used instead. The multi-valued nature of this extension presents a way to allow a user to specify that they never want a packed-refs file (only use "files") or that they never want loose reference files (only use "packed"). However, that functionality is not currently connected. Before actually modifying the files backend to understand these extension settings, do the basic wiring that connects the extensions.refFormat parsing to the creation of the ref backend. A future change will actually change the ref backend initialization based on these settings, but this communication of the extension is sufficiently complicated to be worth an isolated change. For now, also forbid the setting of only "packed". This is done by redirecting the choice of backend to the packed backend when that selection is made. A later change will make the "files"-only extension value ignore the packed backend. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

The extensions.refFormat extension is a multi-valued config that specifies which ref formats are available to the current repository. By default, Git assumes the list of "files" and "packed", unless there is at least one of these extensions specified. With the current values, it is possible for a user to specify only "files" or only "packed". The only-"packed" option was already ruled as invalid since Git's current code has too many places that require a loose reference. This could change in the future. However, we can now allow the user to specify extensions.refFormat=files alone, making it impossible to create a packed-refs file (or to read one that might exist). Signed-off-by: Derrick Stolee <derrickstolee@github.com>

Even though the commit-graph and multi-pack-index file formats specify a number of chunks in their header information, this is optional. The table of contents terminates with a null chunk ID, which can be used instead. The extra value is helpful for some checks, but is ultimately not necessary for the format. This will be important in some future formats. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

It will be helpful to allow a trailing table of contents when writing some file types with the chunk-format API. The main reason is that it allows dynamically computing the chunk sizes while writing the file. This can use fewer resources than precomputing all chunk sizes in advance. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

As a preparatory step to allowing trailing table of contents, store the offsets of each chunk as we write them. This replaces an existing use of a local variable, but the stored value will be used in the next change. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

The existing chunk formats use the table of contents at the beginning of the file. This is intended as a way to speed up the initial loading of the file, but comes at a cost during writes. Each example needs to fully compute how big each chunk will be in advance, which usually requires storing the full file contents in memory. Future file formats may want to use the chunk format API in cases where the writing stage is critical to performance, so we may want to stream updates from an existing file and then only write the table of contents at the end. Add a new 'flags' parameter to write_chunkfile() that allows this behavior. When this is specified, the defensive programming that checks that the chunks are written with the precomputed sizes is disabled. Then, the table of contents is written in reverse order at the end of the hashfile, so a parser can read the chunk list starting from the end of the file (minus the hash). The parsing of these table of contents will come in a later change. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

The new read_trailing_table_of_contents() mimics read_table_of_contents() except that it reads the table of contents in reverse from the end of the given hashfile. The file is given as a memory-mapped section of memory and a size. Automatically calculate the start of the trailing hash and read the table of contents in revers from that position. The errors come along from those in read_table_of_contents(). The one exception is that the chunk_offset cannot be checked as going into the table of contents since we do not have that length automatically. That may have some surprising results for some narrow forms of corruption. However, we do still limit the size to the size of the file plus the part of the table of contents read so far. At minimum, the given sizes can be used to limit parsing within the file itself. Signed-off-by: Derrick Stolee <derrickstolee@github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packed-refs v2 Part III: trailing table of contents in chunk-format #25

Packed-refs v2 Part III: trailing table of contents in chunk-format #25

Commits on Oct 24, 2022

Commits on Oct 27, 2022

Commits on Nov 1, 2022