-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack-objects: add --path-walk option for better deltas #1813
base: master
Are you sure you want to change the base?
pack-objects: add --path-walk option for better deltas #1813
Commits on Oct 8, 2024
-
path-walk: introduce an object walk by path
In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 98bdc94 - Browse repository at this point
Copy the full SHA 98bdc94View commit details -
t6601: add helper for testing path-walk API
Add some tests based on the current behavior, doing interesting checks for different sets of branches, ranges, and the --boundary option. This sets a baseline for the behavior and we can extend it as new options are introduced. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a00ab0c - Browse repository at this point
Copy the full SHA a00ab0cView commit details -
path-walk: allow consumer to specify object types
We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. This adds the ability to ask for the commits in a list, as well. Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 14375d1 - Browse repository at this point
Copy the full SHA 14375d1View commit details
Commits on Oct 9, 2024
-
path-walk: allow visiting tags
In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. When these tag objects point to blobs or trees, we need to make sure those objects are also visited. Treat tagged trees as root trees, but put the tagged blobs in their own category. Be careful about objects that are referred to by multiple references. Co-authored-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for c321f58 - Browse repository at this point
Copy the full SHA c321f58View commit details -
revision: create mark_trees_uninteresting_dense()
The sparse tree walk algorithm was created in d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and involves using the mark_trees_uninteresting_sparse() method. This method takes a repository and an oidset of tree IDs, some of which have the UNINTERESTING flag and some of which do not. Create a method that has an equivalent set of preconditions but uses a "dense" walk (recursively visits all reachable trees, as long as they have not previously been marked UNINTERESTING). This is an important difference from mark_tree_uninteresting(), which short-circuits if the given tree has the UNINTERESTING flag. A use of this method will be added in a later change, with a condition set whether the sparse or dense approach should be used. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 6e89fb2 - Browse repository at this point
Copy the full SHA 6e89fb2View commit details -
path-walk: add prune_all_uninteresting option
This option causes the path-walk API to act like the sparse tree-walk algorithm implemented by mark_trees_uninteresting_sparse() in list-objects.c. Starting from the commits marked as UNINTERESTING, their root trees and all objects reachable from those trees are UNINTERSTING, at least as we walk path-by-path. When we reach a path where all objects associated with that path are marked UNINTERESTING, then do no continue walking the children of that path. We need to be careful to pass the UNINTERESTING flag in a deep way on the UNINTERESTING objects before we start the path-walk, or else the depth-first search for the path-walk API may accidentally report some objects as interesting. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 238d7d9 - Browse repository at this point
Copy the full SHA 238d7d9View commit details -
pack-objects: extract should_attempt_deltas()
This will be helpful in a future change that introduces a new way to compute deltas. Be careful to preserve the nr_deltas counting logic in the existing method, but take the rest of the logic wholesale. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 3fdb57e - Browse repository at this point
Copy the full SHA 3fdb57eView commit details -
pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. The current implementation performs delta calculations while walking objects, which is not ideal for a few reasons. First, this will cause the "Enumerating objects" phase to be much longer than usual. Second, it does not take advantage of threading during the path-scoped delta calculations. Even with this lack of threading, the path-walk option is sometimes faster than the usual approach. Future changes will refactor this code to allow for threading, but that complexity is deferred until later to keep this patch as simple as possible. This new walk is incompatible with some features and is ignored by others: * Object filters are not currently integrated with the path-walk API, such as sparse-checkout or tree depth. A blobless packfile could be integrated easily, but that is deferred for later. * Server-focused features such as delta islands, shallow packs, and using a bitmap index are incompatible with the path-walk API. * The path walk API is only compatible with the --revs option, not taking object lists or pack lists over stdin. These alternative ways to specify the objects currently ignores the --path-walk option without even a warning. Future changes will create performance tests that demonstrate the power of this approach. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for a0475c7 - Browse repository at this point
Copy the full SHA a0475c7View commit details -
pack-objects: update usage to match docs
The t0450 test script verifies that builtin usage matches the synopsis in the documentation. Adjust the builtin to match and then remove 'git pack-objects' from the exception list. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 73c8b61 - Browse repository at this point
Copy the full SHA 73c8b61View commit details -
p5313: add performance tests for --path-walk
The previous change added a --path-walk option to 'git pack-objects'. Create a performance test that demonstrates the time and space benefits of the feature. In order to get an appropriate comparison, we need to avoid reusing deltas and recompute them from scratch. Compare the creation of a thin pack representing a small push and the creation of a relatively large non-thin pack. Running on my copy of the Git repository results in this data: Test this tree --------------------------------------------------------- 5313.2: thin pack 0.01(0.00+0.00) 5313.3: thin pack size 1.1K 5313.4: thin pack with --path-walk 0.01(0.01+0.00) 5313.5: thin pack size with --path-walk 1.1K 5313.6: big pack 2.52(6.59+0.38) 5313.7: big pack size 14.1M 5313.8: big pack with --path-walk 4.90(5.76+0.26) 5313.9: big pack size with --path-walk 13.2M Note that the timing is slower because there is no threading in the --path-walk case (yet). The cases where the --path-walk option really shines is when the default name-hash is overwhelmed with collisions. An open source example can be found in the microsoft/fluentui repo [1] at a certain commit [2]. [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Running the tests on this repo results in the following output: Test this tree ---------------------------------------------------------- 5313.2: thin pack 0.28(0.38+0.02) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.08(0.06+0.01) 5313.5: thin pack size with --path-walk 18.4K 5313.6: big pack 4.05(29.62+0.43) 5313.7: big pack size 20.0M 5313.8: big pack with --path-walk 5.99(9.06+0.24) 5313.9: big pack size with --path-walk 16.4M Notice in particular that in the small thin pack, the time performance has improved from 0.28s to 0.08s and this is likely due to the improved size of the resulting pack: 18.4K instead of 1.2M. Finally, running this on a copy of the Linux kernel repository results in these data points: Test this tree ----------------------------------------------------------- 5313.2: thin pack 0.00(0.00+0.00) 5313.3: thin pack size 5.8K 5313.4: thin pack with --path-walk 0.00(0.01+0.00) 5313.5: thin pack size with --path-walk 5.8K 5313.6: big pack 24.39(65.81+1.31) 5313.7: big pack size 155.7M 5313.8: big pack with --path-walk 41.07(60.69+0.68) 5313.9: big pack size with --path-walk 150.8M Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 21dc372 - Browse repository at this point
Copy the full SHA 21dc372View commit details -
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t0411-clone-from-partial.sh : One test fetches from a repo that does not have the boundary objects. This causes the path-based walk to fail. Disable the variable for this test. - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo without a boundary object. - t5310-pack-bitmaps.sh : One test compares the case when packing with bitmaps to the case when packing without them. Since we disable the test variable when writing bitmaps, this causes a difference in the object list (the --path-walk option adds an extra object). Specify --no-path-walk in both processes for the comparison. Another test checks for a specific delta base, but when computing dynamically without using bitmaps, the base object it too small to be considered in the delta calculations so no base is used. - t5316-pack-delta-depth.sh : This script cares about certain delta choices and their chain lengths. The --path-walk option changes how these chains are selected, and thus changes the results of this test. - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. - t5332-multi-pack-reuse.sh : This test verifies that the preferred pack is used for delta reuse when possible. The --path-walk option is not currently aware of the preferred pack at all, so finds a different delta base. - t7406-submodule-update.sh : When using the variable, the --depth option collides with the --path-walk feature, resulting in a warning message. Disable the variable so this warning does not appear. I want to call out one specific test change that is only temporary: - t5530-upload-pack-error.sh : One test cares specifically about an "unable to read" error message. Since the current implementation performs delta calculations within the path-walk API callback, a different "unable to get size" error message appears. When this is changed in a future refactoring, this test change can be reverted. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 6f96b1c - Browse repository at this point
Copy the full SHA 6f96b1cView commit details -
repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. In my copy of the Git repository, the new tests in p5313 show these results: Test this tree ------------------------------------------------------------- 5313.10: repack 27.88(150.23+2.70) 5313.11: repack size 228.2M 5313.12: repack with --path-walk 134.59(148.77+0.81) 5313.13: repack size with --path-walk 209.7M Note that the 'git pack-objects --path-walk' feature is not integrated with threads. Look forward to a future change that will introduce threading to improve the time performance of this feature with equivalent space performance. For the microsoft/fluentui repo [1] had some interesting aspects for the previous tests in p5313, so here are the repack results: Test this tree ------------------------------------------------------------- 5313.10: repack 91.76(680.94+2.48) 5313.11: repack size 439.1M 5313.12: repack with --path-walk 110.35(130.46+0.74) 5313.13: repack size with --path-walk 155.3M [1] https://github.com/microsoft/fluentui Here, we see the significant improvement of a full repack using this strategy. The name-hash collisions in this repo cause the space problems. Those collisions also cause the repack command to spend a lot of cycles trying to find delta bases among files that are not actually very similar, so the lack of threading with the --path-walk feature is less pronounced in the process time. For the Linux kernel repository, we have these stats: Test this tree --------------------------------------------------------------- 5313.10: repack 553.61(1929.41+30.31) 5313.11: repack size 2.5G 5313.12: repack with --path-walk 1777.63(2044.16+7.47) 5313.13: repack size with --path-walk 2.5G This demonstrates that the --path-walk feature does not always present measurable improvements, especially in cases where the name-hash has very few collisions. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 834c9ea - Browse repository at this point
Copy the full SHA 834c9eaView commit details -
repack: update usage to match docs
The t0450 test script verifies that the builtin usage matches the synopsis in the documentation. Update 'git repack' to match and remove it from the exception list. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 6ef8d67 - Browse repository at this point
Copy the full SHA 6ef8d67View commit details -
pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1db90e3 - Browse repository at this point
Copy the full SHA 1db90e3View commit details
Commits on Oct 18, 2024
-
scalar: enable path-walk during push via config
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 0f3040b - Browse repository at this point
Copy the full SHA 0f3040bView commit details -
pack-objects: refactor path-walk delta phase
Previously, the --path-walk option to 'git pack-objects' would compute deltas inline with the path-walk logic. This would make the progress indicator look like it is taking a long time to enumerate objects, and then very quickly computed deltas. Instead of computing deltas on each region of objects organized by tree, store a list of regions corresponding to these groups. These can later be pulled from the list for delta compression before doing the "global" delta search. This presents a new progress indicator that can be used in tests to verify that this stage is happening. The current implementation is not integrated with threads, but could be done in a future update. Since we do not attempt to sort objects by size until after exploring all trees, we can remove the previous change to t5530 due to a different error message appearing first. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 030d8ec - Browse repository at this point
Copy the full SHA 030d8ecView commit details -
pack-objects: thread the path-based compression
Adapting the implementation of ll_find_deltas(), create a threaded version of the --path-walk compression step in 'git pack-objects'. This involves adding a 'regions' member to the thread_params struct, allowing each thread to own a section of paths. We can simplify the way jobs are split because there is no value in extending the batch based on name-hash the way sections of the object entry array are attempted to be grouped. We re-use the 'list_size' and 'remaining' items for the purpose of borrowing work in progress from other "victim" threads when a thread has finished its batch of work more quickly. Using the Git repository as a test repo, the p5313 performance test shows that the resulting size of the repo is the same, but the threaded implementation gives gains of varying degrees depending on the number of objects being packed. (This was tested on a 16-core machine.) Test HEAD~1 HEAD --------------------------------------------------------------- 5313.2: thin pack 0.01 0.01 +0.0% 5313.4: thin pack with --path-walk 0.01 0.01 +0.0% 5313.6: big pack 2.54 2.60 +2.4% 5313.8: big pack with --path-walk 4.70 3.09 -34.3% 5313.10: repack 28.75 28.55 -0.7% 5313.12: repack with --path-walk 108.55 46.14 -57.5% On the microsoft/fluentui repo, where the --path-walk feature has been shown to be more effective in space savings, we get these results: Test HEAD~1 HEAD ---------------------------------------------------------------- 5313.2: thin pack 0.39 0.40 +2.6% 5313.4: thin pack with --path-walk 0.08 0.07 -12.5% 5313.6: big pack 4.15 4.15 +0.0% 5313.8: big pack with --path-walk 6.41 3.21 -49.9% 5313.10: repack 90.69 90.83 +0.2% 5313.12: repack with --path-walk 108.23 49.09 -54.6% Signed-off-by: Derrick Stolee <stolee@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for fddc320 - Browse repository at this point
Copy the full SHA fddc320View commit details