-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse index: integrate with commit and checkout #973
Sparse index: integrate with commit and checkout #973
Conversation
92e5015
to
61dbce0
Compare
46ca150
to
97dcf77
Compare
97dcf77
to
b2a17ad
Compare
b675893
to
ed7b8a0
Compare
b2a17ad
to
76bd8ec
Compare
See gitgitgadget#973 for the upstream submission. This PR follows #374 which does the heavy lifting. There is one known issue with this integration: directory/file conflicts at the sparse-checkout boundary. This will not affect our dogfooders so we should ignore that bug for the purpose of the `features/sparse-index` branch. I will fix that upstream.
0300d2c
to
8710fee
Compare
76bd8ec
to
1d74484
Compare
/submit |
Submitted as pull.973.git.1624932786.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
This branch is now known as |
This patch series was integrated into seen via git@c2bb0c9. |
This patch series was integrated into seen via git@0c82e10. |
This patch series was integrated into seen via git@018ec5d. |
The branch |
This patch series was integrated into seen via git@9253f69. |
This patch series was integrated into seen via git@5a3b62b. |
This patch series was integrated into seen via git@ece6832. |
This patch series was integrated into seen via git@8f19d32. |
There was a status update in the "Cooking" section about the branch "git checkout" and "git commit" learn to work without unnecessarily expanding sparse indexes. Comments? |
This patch series was integrated into seen via git@ee0efe8. |
This patch series was integrated into seen via git@79cb81b. |
This patch series was integrated into seen via git@96933de. |
Update 'git commit' to allow using the sparse-index in memory without expanding to a full one. The only place that had an ensure_full_index() call was in cache_tree_update(). The recursive algorithm for update_one() was already updated in 2de37c5 (cache-tree: integrate with sparse directory entries, 2021-03-03) to handle sparse directory entries in the index. Most of this change involves testing different command-line options that allow specifying which on-disk changes should be included in the commit. This includes no options (only take currently-staged changes), -a (take all tracked changes), and --include (take a list of specific changes). To simplify testing that these options do not expand the index, update the test that previously verified that 'git status' does not expand the index with a helper method, ensure_not_expanded(). This allows 'git commit' to operate much faster when the sparse-checkout cone is much smaller than the full list of files at HEAD. Here are the relevant lines from p2000-sparse-operations.sh: Test HEAD~1 HEAD ---------------------------------------------------------------------------------- 2000.14: git commit -a -m A (full-v3) 0.35(0.26+0.06) 0.36(0.28+0.07) +2.9% 2000.15: git commit -a -m A (full-v4) 0.32(0.26+0.05) 0.34(0.28+0.06) +6.3% 2000.16: git commit -a -m A (sparse-v3) 0.63(0.59+0.06) 0.04(0.05+0.05) -93.7% 2000.17: git commit -a -m A (sparse-v4) 0.64(0.59+0.08) 0.04(0.04+0.04) -93.8% It is important to compare the full-index case to the sparse-index case, so the improvement for index version v4 is actually an 88% improvement in this synthetic example. In a real repository with over two million files at HEAD and 60,000 files in the sparse-checkout definition, the time for 'git commit -a' went from 2.61 seconds to 134ms. I compared this to the result if the index only contained the paths in the sparse-checkout definition and found the theoretical optimum to be 120ms, so the out-of-cone paths only add a 12% overhead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When some commands run with command_requires_full_index=1, then the index can get in a state where the in-memory cache tree is actually equal to the sparse index's cache tree instead of the full one. This results in incorrect entry_count values. By clearing the cache tree before converting to sparse, we avoid this issue. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Previous changes did the necessary improvements to unpack-trees.c and diff-lib.c in order to modify a sparse index based on its comparision with a tree. The only remaining work is to remove some ensure_full_index() calls and add tests that verify that the index is not expanded in our interesting cases. Include 'switch' and 'restore' in these tests, as they share a base implementation with 'checkout'. Here are the relevant performance results from p2000-sparse-operations.sh: Test HEAD~1 HEAD -------------------------------------------------------------------------------- 2000.18: git checkout -f - (full-v3) 0.49(0.43+0.03) 0.47(0.39+0.05) -4.1% 2000.19: git checkout -f - (full-v4) 0.45(0.37+0.06) 0.42(0.37+0.05) -6.7% 2000.20: git checkout -f - (sparse-v3) 0.76(0.71+0.07) 0.04(0.03+0.04) -94.7% 2000.21: git checkout -f - (sparse-v4) 0.75(0.72+0.04) 0.05(0.06+0.04) -93.3% It is important to compare the full index case to the sparse index case, as the previous results for the sparse index were inflated by the index expansion. For index v4, this is an 88% improvement. On an internal repository with over two million paths at HEAD and a sparse-checkout definition containing ~60,000 of those paths, 'git checkout' went from 3.5s to 297ms with this change. The theoretical optimum where only those ~60,000 paths exist was 275ms, so the extra sparse directory entries contribute a 22ms overhead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Add new branches to the test repo that demonstrate directory/file conflicts in different ways. Since the directory 'folder1/' has adjacent files 'folder1-', 'folder1.txt', and 'folder10' it causes searches for 'folder1/' to land in a different place in the index than a search for 'folder1'. This causes a change in behavior when working with the df-conflict-1 and df-conflict-2 branches, whose only difference is that the first uses 'folder1' as the conflict and the other uses 'folder2' which does not have these adjacent files. We can extend two tests that compare the behavior across different 'git checkout' commands, and we see already that the behavior will be different in some cases and not in others. The difference between the two test loops is that one uses 'git reset --hard' between iterations. Further, we isolate the behavior of creating a staged change within a directory and then checking out a branch where that directory is replaced with a file. A full checkout behaves differently across these two cases, while a sparse-checkout cone behaves consistently. In both cases, the behavior is wrong. In one case, the staged change is dropped entirely. The other case the staged change is kept, replacing the file at that location, but none of the other files in the directory are kept. Likely, the correct behavior in this case is to reject the checkout and report the conflict, leaving HEAD in its previous location. None of the cases behave this way currently. Use comments to demonstrate that the tested behavior is only a documentation of the current, incorrect behavior to ensure we do not _accidentally_ change it. Instead, we would prefer to change it on purpose with a future change. At this point, the sparse-index does not handle these 'git checkout' commands correctly. Or rather, it _does_ reject the 'git checkout' when we have the staged change, but for the wrong reason. It also rejects the 'git checkout' commands when there is no staged change and we want to replace a directory with a file. A fix for that unstaged case will follow in the next change, but that will make the sparse-index agree with the full checkout case in these documented incorrect behaviors. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When running unpack_trees() with a sparse index, we attempt to operate on the index without expanding the sparse directory entries. Thus, we operate by manipulating entire directories and passing them to the unpack function. In the case of the 'git checkout' command, this is the twoway_merge() function. There are several cases in twoway_merge() that handle different situations. One new one to add is the case of a directory/file conflict where the directory is sparse. Before the sparse index, such a conflict would appear as a list of file additions and deletions. Now, twoway_merge() initializes 'current', 'oldtree', and 'newtree' from src[0], src[1], and src[2], then sets 'oldtree' to NULL because it is equal to the df_conflict_entry. The way to determine that we have a directory/file conflict is to test that 'current' and 'newtree' disagree on being sparse directory entries. When we are in this case, we want to resolve the situation by calling merged_entry(). This allows replacing the 'current' entry with the 'newtree' entry. This is important for cases where we want to run 'git checkout' across the conflict and have the new HEAD represent the new file type at that path. The first NEEDSWORK comment dropped in t1092 demonstrates this necessary behavior. However, we still are in a confusing state when 'current' corresponds to a staged change within a sparse directory that is not present at HEAD. This should be atypical, because it requires adding a change outside of the sparse-checkout cone, but it is possible. Since we are unable to determine that this is a staged change within twoway_merge(), we cannot add a case to reject the merge at this point. I believe this is due to the use of df_conflict_entry in the place of 'oldtree' instead of using the valud at HEAD, which would provide some perspective to this decision. Any change that would allow this differentiation for staged entries would need to involve information further up in unpack_trees(). That work should be done, sometime, because we are further confusing the behavior of a directory/file conflict when staging a change in the directory. The two cases 'checkout behaves oddly with df-conflict-?' in t1092 demonstrate that even without a sparse-checkout, Git is not consistent in its behavior. Neither of the two options seems correct, either. This change makes the sparse-index behave differently than the typcial sparse-checkout case, but it does match the full checkout behavior in the df-conflict-2 case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
8710fee
to
71e3015
Compare
This patch series was integrated into seen via git@2e683b4. |
/submit |
Submitted as pull.973.v2.git.1626812081.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
This patch series was integrated into seen via git@eedf36e. |
This patch series was integrated into seen via git@aa77b60. |
There was a status update in the "Cooking" section about the branch "git checkout" and "git commit" learn to work without unnecessarily expanding sparse indexes. Will merge to 'next'. |
@@ -95,6 +95,25 @@ test_expect_success 'setup' ' | |||
git add . && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 20, 2021 at 1:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running unpack_trees() with a sparse index, we attempt to operate
> on the index without expanding the sparse directory entries. Thus, we
> operate by manipulating entire directories and passing them to the
> unpack function. In the case of the 'git checkout' command, this is the
> twoway_merge() function.
>
> There are several cases in twoway_merge() that handle different
> situations. One new one to add is the case of a directory/file conflict
> where the directory is sparse. Before the sparse index, such a conflict
> would appear as a list of file additions and deletions. Now,
> twoway_merge() initializes 'current', 'oldtree', and 'newtree' from
> src[0], src[1], and src[2], then sets 'oldtree' to NULL because it is
> equal to the df_conflict_entry. The way to determine that we have a
> directory/file conflict is to test that 'current' and 'newtree' disagree
> on being sparse directory entries.
>
> When we are in this case, we want to resolve the situation by calling
> merged_entry(). This allows replacing the 'current' entry with the
> 'newtree' entry. This is important for cases where we want to run 'git
> checkout' across the conflict and have the new HEAD represent the new
> file type at that path. The first NEEDSWORK comment dropped in t1092
> demonstrates this necessary behavior.
>
> However, we still are in a confusing state when 'current' corresponds to
> a staged change within a sparse directory that is not present at HEAD.
> This should be atypical, because it requires adding a change outside of
> the sparse-checkout cone, but it is possible. Since we are unable to
> determine that this is a staged change within twoway_merge(), we cannot
> add a case to reject the merge at this point. I believe this is due to
> the use of df_conflict_entry in the place of 'oldtree' instead of using
> the valud at HEAD, which would provide some perspective to this
> decision. Any change that would allow this differentiation for staged
> entries would need to involve information further up in unpack_trees().
>
> That work should be done, sometime, because we are further confusing the
> behavior of a directory/file conflict when staging a change in the
> directory. The two cases 'checkout behaves oddly with df-conflict-?' in
> t1092 demonstrate that even without a sparse-checkout, Git is not
> consistent in its behavior. Neither of the two options seems correct,
> either. This change makes the sparse-index behave differently than the
> typcial sparse-checkout case, but it does match the full checkout
> behavior in the df-conflict-2 case.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++------------
> unpack-trees.c | 11 +++++++++++
> 2 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 79b4a8ce199..91e30d6ec22 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -396,14 +396,14 @@ test_expect_success 'diff with renames and conflicts' '
> done
> '
>
> -# NEEDSWORK: the sparse-index fails to move HEAD across a directory/file
> -# conflict such as when checking out df-conflict-1 and df-conflict2.
> test_expect_success 'diff with directory/file conflicts' '
> init_repos &&
>
> for branch in rename-out-to-out \
> rename-out-to-in \
> rename-in-to-out \
> + df-conflict-1 \
> + df-conflict-2 \
> fd-conflict
> do
> git -C full-checkout reset --hard &&
> @@ -667,10 +667,7 @@ test_expect_success 'checkout behaves oddly with df-conflict-1' '
> git -C sparse-checkout checkout df-conflict-1 \
> 1>sparse-checkout-out \
> 2>sparse-checkout-err &&
> -
> - # NEEDSWORK: the sparse-index case refuses to change HEAD here,
> - # but for the wrong reason.
> - test_must_fail git -C sparse-index checkout df-conflict-1 \
> + git -C sparse-index checkout df-conflict-1 \
> 1>sparse-index-out \
> 2>sparse-index-err &&
>
> @@ -684,7 +681,11 @@ test_expect_success 'checkout behaves oddly with df-conflict-1' '
> test_cmp expect full-checkout-out &&
> test_cmp expect sparse-checkout-out &&
>
> + # The sparse-index reports no output
> + test_must_be_empty sparse-index-out &&
> +
> # stderr: Switched to branch df-conflict-1
> + test_cmp full-checkout-err sparse-checkout-err &&
> test_cmp full-checkout-err sparse-checkout-err
> '
>
> @@ -719,17 +720,15 @@ test_expect_success 'checkout behaves oddly with df-conflict-2' '
> git -C sparse-checkout checkout df-conflict-2 \
> 1>sparse-checkout-out \
> 2>sparse-checkout-err &&
> -
> - # NEEDSWORK: the sparse-index case refuses to change HEAD
> - # here, but for the wrong reason.
> - test_must_fail git -C sparse-index checkout df-conflict-2 \
> + git -C sparse-index checkout df-conflict-2 \
> 1>sparse-index-out \
> 2>sparse-index-err &&
>
> # The full checkout deviates from the df-conflict-1 case here!
> # It drops the change to folder1/larger-content and leaves the
> - # folder1 path as-is on disk.
> + # folder1 path as-is on disk. The sparse-index behaves the same.
> test_must_be_empty full-checkout-out &&
> + test_must_be_empty sparse-index-out &&
>
> # In the sparse-checkout case, the checkout deletes the folder1
> # file and adds the folder1/larger-content file, leaving all other
> @@ -741,7 +740,8 @@ test_expect_success 'checkout behaves oddly with df-conflict-2' '
> test_cmp expect sparse-checkout-out &&
>
> # Switched to branch df-conflict-1
> - test_cmp full-checkout-err sparse-checkout-err
> + test_cmp full-checkout-err sparse-checkout-err &&
> + test_cmp full-checkout-err sparse-index-err
> '
>
> test_done
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 0a5135ab397..309c1352f5d 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -2619,6 +2619,17 @@ int twoway_merge(const struct cache_entry * const *src,
> same(current, oldtree) && !same(current, newtree)) {
> /* 20 or 21 */
> return merged_entry(newtree, current, o);
> + } else if (current && !oldtree && newtree &&
> + S_ISSPARSEDIR(current->ce_mode) != S_ISSPARSEDIR(newtree->ce_mode) &&
> + ce_stage(current) == 0) {
> + /*
> + * This case is a directory/file conflict across the sparse-index
> + * boundary. When we are changing from one path to another via
> + * 'git checkout', then we want to replace one entry with another
> + * via merged_entry(). If there are staged changes, then we should
> + * reject the merge instead.
> + */
> + return merged_entry(newtree, current, o);
> } else
> return reject_merge(current, o);
> }
> --
I'm still a bit unhappy with the unpack-trees.c change (I wonder if
having "path/" vs "path" is going to make D/F conflicts hard to handle
and whether we need to make unpack_trees do something special to make
both paths be considered at the same time with one call to
twoway_merge() instead of two in order to fix this), BUT I think
you've done a really good job of documenting it and pointing out that
unpack_trees() messes up even without sparse checkouts on D/F
conflicts (though in a different way). I think you've documented it
well enough, and argued about the likelihood of issues well enough,
that it makes sense to proceed and circle back and fix this up later.
On the Git mailing list, Elijah Newren wrote (reply to this):
|
This patch series was integrated into seen via git@f0a2666. |
This patch series was integrated into seen via git@836580b. |
This patch series was integrated into seen via git@6df4569. |
This patch series was integrated into seen via git@246aeea. |
This patch series was integrated into seen via git@a6376fc. |
This patch series was integrated into seen via git@098e4d6. |
There was a status update in the "Cooking" section about the branch "git checkout" and "git commit" learn to work without unnecessarily expanding sparse indexes. Will merge to 'next'. |
This patch series was integrated into seen via git@f922306. |
This patch series was integrated into seen via git@f984070. |
This patch series was integrated into seen via git@fe9133f. |
There was a status update in the "Cooking" section about the branch "git checkout" and "git commit" learn to work without unnecessarily expanding sparse indexes. Will merge to 'next'. |
This patch series was integrated into seen via git@346301c. |
This patch series was integrated into next via git@52ed1b0. |
This series extends our integration of sparse-index to 'git commit' and 'git checkout'.
This is based on ds/status-with-sparse-index (v7) and v2.32.0. The hard work was already done in that topic, so these changes are simple.
Recall that we have delayed our integration with 'git add' until we can work out the concerns about how to deal with pathspecs outside of the sparse-checkout definition. Those concerns might have some overlap with how 'git commit' takes a pathspec, but this seems like a rare enough case to handle here and we can be more careful with the behavior change in the next series which will integrate with
git add
.In addition to the tests that already exist in t1092, I have integrated these changes in microsoft/git and tested them against the Scalar functional tests, which go through quite a few complicated scenarios, verifying that things work the same across the full index and sparse-index cases.
Update in V2
Thanks,
-Stolee
cc: gitster@pobox.com
cc: newren@gmail.com
cc: matheus.bernardino@usp.br
cc: stolee@gmail.com