Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid worktree checkouts #1207

Merged
merged 6 commits into from
Jul 17, 2023
Merged

avoid worktree checkouts #1207

merged 6 commits into from
Jul 17, 2023

Conversation

Byron
Copy link
Contributor

@Byron Byron commented Jul 17, 2023

Motivation

Clones of the crates-index are very slow on windows., and cargo is also not doing them but uses bare clones instead.

Tasks

  • create bare clones to not even have a worktree checkout
  • access the repository directly instead of the worktree checkout

Guidance

If you point me to the place where the full checkout is needed, I could probably figure out the rest from there as long as a git repository is present that contains the crates-index. It's not a big deal at all (probably not even famous last words ;)).

Here is the code we use for cloning and checkout.

It is then used in git_registry.rs and resolve.rs. In both scenario, they try to open a few files (could contain directory) and read their contents into memory and parse them.

In git_registry.rs, we need to read config.json and then read a registry file for each crate queried here.

In resolve.rs, we would call load_manifest_from_workspace to load the find the crate requested in the workspace.

Source

@Byron Byron marked this pull request as ready for review July 17, 2023 13:41
@Byron
Copy link
Contributor Author

Byron commented Jul 17, 2023

@NobodyXu This should do the trick!

I was definitely surprised when I saw that it will do a temporary clone of the crates index just to get a single file, but at least it's better now that it doesn't have to checkout 120k files to get there. I also now realize how prohibitive it would be to try to install a binary on windows that way.

Somehow I expected it to keep a copy of the index around and maintain that instead, as otherwise there seems to be no value to use git at all here, just overhead. Instead, I thought one would fetch the file from GitHub directly.

In any case, please feel free to push changes directly into this PR so it can be merged quickly. I consider this more like a basis that you can build upon if there is more to add.

Thanks and cheers.

crates/binstalk/src/helpers/git.rs Outdated Show resolved Hide resolved
crates/binstalk/src/helpers/git.rs Show resolved Hide resolved
crates/binstalk/src/helpers/git.rs Show resolved Hide resolved
crates/binstalk/src/helpers/git.rs Outdated Show resolved Hide resolved
@NobodyXu
Copy link
Member

@Byron Thank you very much for this PR!

I was definitely surprised when I saw that it will do a temporary clone of the crates index just to get a single file, but at least it's better now that it doesn't have to checkout 120k files to get there.

Yeah, it's much better than checking out 120k files.

I also now realize how prohibitive it would be to try to install a binary on windows that way.

This is even more true if you have the virus scanner turned on...
A disaster for performance.

Somehow I expected it to keep a copy of the index around and maintain that instead, as otherwise there seems to be no value to use git at all here, just overhead.

It's cloned only once in crates/bin/src/entry.rs#L147.

It uses OnceCell<GitIndex> to lazily initialize the git repository.

Instead, I thought one would fetch the file from GitHub directly.

GitRegistry is intended to be generic and support any git hosting service.

In any case, please feel free to push changes directly into this PR so it can be merged quickly. I consider this more like a basis that you can build upon if there is more to add.

Hmm, I'm not sure if I have the permission to push to your repository, and I usually don't like pushing to others' PR.

This PR is actually quite well done and you would just need to take a few changes annd modify load_manifest_from_workspace and resolve.rs.

@Byron
Copy link
Contributor Author

Byron commented Jul 17, 2023

Thanks for the notes! I hoped you could apply the changes yourself and push directly into my branch. With gh checkout 1207 gh pr checkout 1207 one can make changes, and just git push them back right here. I found it's faster to just do the necessary adjustments myself than to communicate them.
Please give it a shot, or let me know if you'd rather have me perform these actions.

@NobodyXu

This comment was marked as outdated.

@NobodyXu
Copy link
Member

NobodyXu commented Jul 17, 2023

Thanks for the notes! I hoped you could apply the changes yourself and push directly into my branch. With gh checkout 1207 one can make changes, and just git push them back right here.

Thanks, I didn't know I could do that!

I found it's faster to just do the necessary adjustments myself than to communicate them. Please give it a shot, or let me know if you'd rather have me perform these actions.

True, thanks for the advice, I would make that change real quick.

@NobodyXu
Copy link
Member

With gh checkout 1207 one can make changes

Typo: I think you mean gh pr checkout 1207

@NobodyXu
Copy link
Member

@Byron I need some help: is there anyway to find out if a path is a dir or a regular file or symlink?
Ideally I would like load_manifest_from_workspace to also avoid checking out entire worktree, though I can delay it to another PR.

@Byron
Copy link
Contributor Author

Byron commented Jul 17, 2023

@Byron I need some help: is there anyway to find out if a path is a dir or a regular file or symlink?

When looking up an entry in a tree, one can also ask for the entry's mode. The mode indicates if it's a blob, symlink, or executable. Indeed, symlinks would need another lookup then.

@NobodyXu
Copy link
Member

NobodyXu commented Jul 17, 2023

@Byron I need some help: is there anyway to find out if a path is a dir or a regular file or symlink?

When looking up an entry in a tree, one can also ask for the entry's mode. The mode indicates if it's a blob, symlink, or executable. Indeed, symlinks would need another lookup then.

Thanks, is there anyway to list dir and tell if a path corresponds to a directory?

Edit:

Oh I see it's EntryMode::Tree isn't it?

But still I don't know how to get the destination of the symlink and children of the dir.

@NobodyXu
Copy link
Member

But still I don't know how to get the destination of the symlink and children of the dir.

So for directory, if my guess is right, the mode will be EntryMode::Tree and I can use Object::try_into_tree to get another tree which I can lookup.

For symlink, according to my knowledge, git simply encodes the destination as the content of the file, so I just need to obtain the Entry::data.

But converting it to Path would take some effort since it is Vec<u8>, or perhaps I can split the &[u8] by / (assuming git always store path splitted by /) and feed it to Tree::lookup_entry?

@Byron Is my guess about this correct?

Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
and also refactor implementation of `git::Repository::{shallow_clone,
shallow_clone_bare}`.

Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
@Byron
Copy link
Contributor Author

Byron commented Jul 17, 2023

But still I don't know how to get the destination of the symlink and children of the dir.

So for directory, if my guess is right, the mode will be EntryMode::Tree and I can use Object::try_into_tree to get another tree which I can lookup.

Yes, that should work.

For symlink, according to my knowledge, git simply encodes the destination as the content of the file, so I just need to obtain the Entry::data.

But converting it to Path would take some effort since it is Vec<u8>, or perhaps I can split the &[u8] by / (assuming git always store path splitted by /) and feed it to Tree::lookup_entry?

@Byron Is my guess about this correct?

It could be a file or directory, but since it's the crates.io repository assumptions can be made. It's also a fair assumption that that paths are separated by /. And I say assumption git doesn't normalize the contents of symlinks at all, it just reads it and stores it as is.

However, I think that std::path::Path can handle this on all platforms.

Is this really something that is needed? The entire index doesn't seem to have a single symlink.

github.com-1ecc6299db9ec823 ( main)
❯ gix index entries | rg SYM

@NobodyXu
Copy link
Member

But still I don't know how to get the destination of the symlink and children of the dir.

So for directory, if my guess is right, the mode will be EntryMode::Tree and I can use Object::try_into_tree to get another tree which I can lookup.

Yes, that should work.

Thanks for confimration!

For symlink, according to my knowledge, git simply encodes the destination as the content of the file, so I just need to obtain the Entry::data.
But converting it to Path would take some effort since it is Vec<u8>, or perhaps I can split the &[u8] by / (assuming git always store path splitted by /) and feed it to Tree::lookup_entry?
@Byron Is my guess about this correct?

It could be a file or directory, but since it's the crates.io repository assumptions can be made. It's also a fair assumption that that paths are separated by /. And I say assumption git doesn't normalize the contents of symlinks at all, it just reads it and stores it as is.

However, I think that std::path::Path can handle this on all platforms.

Well std::path::Path::new accepts AsRef<OsStr> and AsRef<OsStr> is not implemented for &[u8].
So if git doesn't normalize content of symlinks and it could use / or \, then it will be a little bit tricky.

Is this really something that is needed? The entire index doesn't seem to have a single symlink.

github.com-1ecc6299db9ec823 ( main)
❯ gix index entries | rg SYM

I was actually thinking about how to avoid checking out files for --git in #1208
Granted, usually repository should be pretty fast to checkout anyway but I wonder if there is any giant repository that might be helped by this.

Not sure how this interfaces with submodule though, and I'm not sure if people will use --git on repository with submodule and expects the submodule to be checked out, although for correctness it seems to be better to check it out.

NobodyXu added 2 commits July 18, 2023 01:10
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
@Byron
Copy link
Contributor Author

Byron commented Jul 17, 2023

Well std::path::Path::new accepts AsRef and AsRef is not implemented for &[u8].
So if git doesn't normalize content of symlinks and it could use / or , then it will be a little bit tricky.

gix::path:: should have you covered when it comes to conversions. Note that Vec converts to BString, and from there one can probably do gix::path::from_bstr(path) to get an actual Path.

Granted, usually repository should be pretty fast to checkout anyway but I wonder if there is any giant repository that might be helped by this.

Not sure how this interfaces with submodule though, and I'm not sure if people will use --git on repository with submodule and expects the submodule to be checked out, although for correctness it seems to be better to check it out.

I see that you would like to cover every possible case, any kind of custom crates.io repo that the user throws at you. To my mind it's always preferable to gracefully fail with assertions or errors and wait until people create issues for that. After all, nobody might ever run into these cases. For gitoxide I have to operate like this all the time - the core path I need is 100%, and along the fringes it's allowed to degrade gracefully and bug reports/features requests are the indicator for demand.

@NobodyXu
Copy link
Member

Well std::path::Path::new accepts AsRef and AsRef is not implemented for &[u8].
So if git doesn't normalize content of symlinks and it could use / or , then it will be a little bit tricky.

gix::path:: should have you covered when it comes to conversions. Note that Vec converts to BString, and from there one can probably do gix::path::from_bstr(path) to get an actual Path.

Thanks!

Granted, usually repository should be pretty fast to checkout anyway but I wonder if there is any giant repository that might be helped by this.

Not sure how this interfaces with submodule though, and I'm not sure if people will use --git on repository with submodule and expects the submodule to be checked out, although for correctness it seems to be better to check it out.

I see that you would like to cover every possible case, any kind of custom crates.io repo that the user throws at you. To my mind it's always preferable to gracefully fail with assertions or errors and wait until people create issues for that. After all, nobody might ever run into these cases. For gitoxide I have to operate like this all the time - the core path I need is 100%, and along the fringes it's allowed to degrade gracefully and bug reports/features requests are the indicator for demand.

Well it's actually not for registry, but custom --git:

      --git <GIT>
          Override how to fetch Cargo.toml package manifest.
          
          This skip searching crates.io and instead clone the repository specified and runs as if `--manif
est-path $cloned_repo` is passed to binstall.
          
          This option cannot be used with `--manifest-path`.

It is the same as --git from cargo-install, it clones the repository specified and install the software specified by its Cargo.toml.

@NobodyXu
Copy link
Member

NobodyXu commented Jul 17, 2023

This PR also speeds up running cargo-test in CI, since binstalk also clones the git and sparse registry in #[test].

@NobodyXu NobodyXu enabled auto-merge July 17, 2023 15:40
@NobodyXu NobodyXu added this pull request to the merge queue Jul 17, 2023
@NobodyXu
Copy link
Member

Can confirm that cloning crates.io git registry now takes about 13s on CI.

@NobodyXu
Copy link
Member

Well it's actually not for registry, but custom --git:

Hmmm I gave it up because

it seems like that if we use shallow_clone_bare when dealiing when user provided --git, then we won't be able to handle submodule...

And load_manifest_from_workspace would have to be way more complicated with a trait for virtual filesystem.

Given that it is unlikely for any repository to be as large as crates.io-registry, I think it's better left as-is.

from #1208 (comment)

Merged via the queue into cargo-bins:main with commit 5acfda9 Jul 17, 2023
@Byron Byron deleted the bare-clones branch July 17, 2023 16:29
crates/binstalk/src/helpers/git.rs Show resolved Hide resolved
crates/binstalk/src/helpers/git.rs Show resolved Hide resolved
crates/binstalk/src/helpers/git.rs Show resolved Hide resolved
@NobodyXu
Copy link
Member

@Byron Thank you for your work!
This PR will soon go into the next release of cargo-binstall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants