Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpack-trees: enable fscache for sparse-checkout #2224

Merged
merged 1 commit into from
Jun 18, 2019

Conversation

derrickstolee
Copy link

When updating the skip-worktree bits in the index to align with new
values in a sparse-checkout file, Git scans the entire working
directory with lstat() calls. In a sparse-checkout, many of these
lstat() calls are for paths that do not exist.

Enable the fscache feature during this scan.

In a local test of a repo with ~2.2 million paths, updating the index
with git read-tree -m -u HEAD with a sparse-checkout file containing
only /.gitattributes improved from 2-3 minutes to 15-20 seconds.

More work could be done to stop running lstat() calls when recursing
into directories that are known to not exist.

@jeffhostetler
Copy link

On my iPad so I can’t dig too deep right now, but this looks good.

You might confirm (in other commands) that have historically used fscache
whether that usage occurs before or after this new usage (I’m assuming
that this change will be on the flow of other commands). And whether
the disable cache call flushes it and whether that later usage would just
rebuild it unnecessarily. Alternatively, make sure there are no writes to
the work dirt between your disable here and the later enable (meaning
they should have flushed it).

Can you check whether you have “preload index” turned on and
whether this code runs before or after that. IIRC that code does
something similar and I threaded it. It’s been too long since I was
in that code to remember the subtleties and overlap here.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions...

unpack-trees.c Outdated
@@ -1437,7 +1437,9 @@ static void mark_new_skip_worktree(struct exclude_list *el,
* 2. Widen worktree according to sparse-checkout file.
* Matched entries will have skip_wt_flag cleared (i.e. "in")
*/
enable_fscache(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to initialize the hashmap size of the FSCache with istate->cache_nr? That way, the hashmap does not have to grow a large number of times before reaching its final size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have improved the performance to ~6 seconds!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! 👍

clear_ce_flags(istate, select_flag, skip_wt_flag, el);
disable_fscache();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to mention in the commit message that disable_fscache() decrements a counter, and only actually discards the FSCache when that counter reaches 0. (To address one of @jeffhostetler's concerns.)

@dscho
Copy link
Member

dscho commented Jun 13, 2019

Can you check whether you have “preload index” turned on and
whether this code runs before or after that.

The most common route for this is the refresh_index() call in read-cache.c that calls preload_index(). Before returning, refresh_index() disables the FSCache again.

Meaning: the FSCache is discarded unless the refresh_index() call happens between an enable_fscache()/disable_cache() pair (they can nest, in which case disable_cache() only decrements a counter but does not discard the FSCache).

In general, it is not a safe thing to keep the FSCache, as there are many possibilities for the cached values to become stale.

@dscho
Copy link
Member

dscho commented Jun 13, 2019

In a local test of a repo with ~2.2 million paths, updating the index
with git read-tree -m -u HEAD with a sparse-checkout file containing
only /.gitattributes improved from 2-3 minutes to 15-20 seconds.

Wow, that's impressive!

More work could be done to stop running lstat() calls when recursing
into directories that are known to not exist.

Indeed. Don't we have code in the FSCache to detect that already?

@dscho
Copy link
Member

dscho commented Jun 13, 2019

In a local test of a repo with ~2.2 million paths, updating the index
with git read-tree -m -u HEAD with a sparse-checkout file containing
only /.gitattributes improved from 2-3 minutes to 15-20 seconds.

Wow, that's impressive!

Oh, you know, may I ask to include that piece of information in the commit message?

@jeffhostetler
Copy link

@dscho Yes, I added some code to fscache to know about not-found directories bac in:

64ab4538f7fa00f67398e5e16060323e8464dd19  fscache: remember not-found directories

@derrickstolee
Copy link
Author

More work could be done to stop running lstat() calls when recursing
into directories that are known to not exist.

Indeed. Don't we have code in the FSCache to detect that already?

Apparently not, as I still see the deep dir calls (/my/really/long/path) even
when the shallow dir (/my) doesn't exist. I'm looking into ways to add this
to fscache.

When updating the skip-worktree bits in the index to align with new
values in a sparse-checkout file, Git scans the entire working
directory with lstat() calls. In a sparse-checkout, many of these
lstat() calls are for paths that do not exist.

Enable the fscache feature during this scan. Since enable_fscache()
calls nest, the disable_fscache() method decrements a counter and
would only clear the cache if that counter reaches zero.

In a local test of a repo with ~2.2 million paths, updating the index
with git read-tree -m -u HEAD with a sparse-checkout file containing
only /.gitattributes improved from 2-3 minutes to ~6 seconds.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@dscho
Copy link
Member

dscho commented Jun 13, 2019

More work could be done to stop running lstat() calls when recursing
into directories that are known to not exist.

Indeed. Don't we have code in the FSCache to detect that already?

Apparently not, as I still see the deep dir calls (/my/really/long/path) even
when the shallow dir (/my) doesn't exist. I'm looking into ways to add this
to fscache.

Ah, I think I remember enough details now: when lstat()-ing a directory that does not exist, the FSCache code is supposed to add entries so that a subsequent lstat() call does not cause extra I/O. The idea can be seen here:

fse = NULL; /* non-existing directory */

and here:

git/compat/win32/fscache.c

Lines 570 to 571 in d003d72

if (!fse)
return -1;

I guess what we need is code that looks harder for non-existent directories, e.g. by calling fscache_get() recursively for all parent directories, adding placeholder entries for all of them if it hits placeholder for a non-existing directory.

@derrickstolee
Copy link
Author

I guess what we need is code that looks harder for non-existent directories, e.g. by calling fscache_get() recursively for all parent directories, adding placeholder entries for all of them if it hits placeholder for a non-existing directory.

I'm hoping that we can duplicate the logic from this section of fscache_get() that looks for the directory above a file to instead look for the parent directory of a directory:

	/* if looking for a file, check if directory listing is in cache */
	if (!fse && key->list) {
		fse = hashmap_get(&cache->map, key->list, NULL);
		if (fse) {
			/*
			 * dir entry without file entry, or dir does not
			 * exist -> file doesn't exist
			 */
			errno = ENOENT;
			return NULL;
		}
	}

If we can do a similar "one level up" lookup, then we should get it for free without recursing up all parents since the index file should list the parent first.

@dscho
Copy link
Member

dscho commented Jun 18, 2019

If we can do a similar "one level up" lookup, then we should get it for free without recursing up all parents since the index file should list the parent first.

Do you still want to do that in the context of this here PR? Or would you rather have me merge it and we'll try to look at reducing the cache misses later?

@derrickstolee
Copy link
Author

If we can do a similar "one level up" lookup, then we should get it for free without recursing up all parents since the index file should list the parent first.

Do you still want to do that in the context of this here PR? Or would you rather have me merge it and we'll try to look at reducing the cache misses later?

This is something to try independently. Thanks!

@derrickstolee derrickstolee merged commit e213867 into git-for-windows:master Jun 18, 2019
dscho added a commit to git-for-windows/build-extra that referenced this pull request Jul 23, 2019
The FSCache feature [is now used with `git checkout` and `git reset`
in sparse checkouts](git-for-windows/git#2224).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants