Skip to content

Shallow clones of submodules #34228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
srawlins opened this issue Jun 11, 2016 · 9 comments
Closed

Shallow clones of submodules #34228

srawlins opened this issue Jun 11, 2016 · 9 comments

Comments

@srawlins
Copy link

(This is a new issue for this comment on #30107.)

Right now configure performs deep clones of submodules (and recursively). This means downloading tons and tons of history (especially in the case of LLVM). Bad for slow connections, etc.

But we should be able to perform "shallow" clones, passing --depth 1 to git submodule update.

I've tried this locally, and git wasn't happy about it:

rust$ ./configure
...
configure: git: submodule init
configure: git: submodule update --depth 1
error: no such remote ref 80ad955b60b3ac02d0462a4a65fcea597d0ebfb1
Fetched in submodule path 'src/llvm', but it did not contain 80ad955b60b3ac02d0462a4a65fcea597d0ebfb1. Direct fetching of that commit failed.
configure: error: git failed

I am pretty sure this is this issue. The solution requires the uploadpack.allowReachableSHA1InWant repository configuration flag, which, if isaacs/github#436 is up-to-date, is not set on GitHub.

So I think this issue is blocked on isaacs/github#436, for a start.

@sanmai-NL
Copy link

Sorry for barging in, but why are git submodules even used for build dependencies instead of git archives? Downloading the archive from https://github.com/rust-lang/llvm/archive/master.tar.gz or some tagged version is faster and does not interact with local git configuration.

  time git clone --depth=1 'https://github.com/rust-lang/llvm.git' '/tmp/llvm/'
Cloning into '/tmp/llvm.tar'...
remote: Counting objects: 10756, done.
remote: Compressing objects: 100% (10259/10259), done.
remote: Total 10756 (delta 1181), reused 4041 (delta 413), pack-reused 0
Receiving objects: 100% (10756/10756), 14.40 MiB | 4.58 MiB/s, done.
Resolving deltas: 100% (1181/1181), done.
Checking connectivity... done.
1.32user 0.30system 0:05.52elapsed 29%CPU (0avgtext+0avgdata 23764maxresident)k
0inputs+25336outputs (0major+884minor)pagefaults 0swaps
time curl -L 'https://github.com/rust-lang/llvm/archive/master.tar.gz' | tar -C '/tmp/llvm2/' -xzf -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   122    0   122    0     0    234      0 --:--:-- --:--:-- --:--:--   234
100 12.3M  100 12.3M    0     0  2947k      0  0:00:04  0:00:04 --:--:-- 4233k
0.11user 0.11system 0:04.30elapsed 5%CPU (0avgtext+0avgdata 8956maxresident)k
0inputs+0outputs (0major+880minor)pagefaults 0swaps

@alexcrichton
Copy link
Member

I believe we can do this today with git clone --depth 1 so long as we choose the right branch to clone as well (e.g. also pass --branch).

@sanmai-NL yeah git submodules aren't always the speediest but they work most reliably because all you need is git to clone the repo, not other tools.

@sanmai-NL
Copy link

sanmai-NL commented Jun 14, 2016

@alexcrichton: that makes sense, on the other hand, the potential influence on behavior during build of local Git config makes it a less suitable tool for batch downloading than e.g. curl. It appears that curl in some form is a dependency of git BTW. In case it is, then using curl is a better option I think.

@srawlins
Copy link
Author

The submodules are updated every time you run ./configure. With submodules you get incremental updates that curl could not provide, right?

@sanmai-NL
Copy link

sanmai-NL commented Jun 17, 2016

I think this topic needs benchmarks and other evaluations. I recall that Rust build times are an issue of concern. This dependency fetching aspect seems like low hanging fruit to reduce the build time.

Use case
Conditions:

  1. Submodules are used by the project to manage a few large dependencies.
  2. A rebuild of the project is to be performed.
    Actors: Developers interested in nightly builds of the project, as opposed to end-users and packagers.
    The developer starts a rebuild of Rust within the same copy of the rust-lang/rust (either a Git clone or an extracted archive).

A quick analysis of the solutions so far, purely focusing on efficiency considerations

Solution A: Current solution
Feasibility: possible, implemented.
Cost:

  1. Initially →high (needs more evidence)?
  2. Consequently, when same → << (1.).
  3. Consequently, when changed → < (1.)

Solution B: Current solution with --depth
Feasibility: Not possible unless GitHub changes its infra. GitHub hasn't responded to the issue since it was filed about a year ago.
Cost if feasible:

  1. Initially → low (needs more evidence).
  2. Consequently, when same → << (1.).
  3. Consequently, when changed → < (1.).

Solution C: curl with efficient content updating (-z/--time-cond option)

mkdir -- '/tmp/llvm/' &&
cd -- "$_" &&
curl --location --time-cond 'llvm-master.tar.gz' --output 'llvm-master.tar.gz' 'https://codeload.github.com/rust-lang/llvm/tar.gz/master' &&
tar -x -z -f 'llvm-master.tar.gz' &&
cd -

Feasibility: Possible if curl or a similar utility is available on build systems.
Cost if feasible:

  1. Initially → very low (needs more evidence).
  2. Consequently, when same ∧ GitHub changed its infra → << (1.), else == (1.).
  3. Consequently, when changed → == (1.).

Observations

  1. Cost if feasible, case 3 is unlikely and only relevant in the first place for this specific use case. Shouldn't builds happen against a fixed version of dependencies? Even if not a hard guarantee, indirectly, this is the case for Rust, as e.g. their fork of the LLVM dependency isn't being updated regularly as far as I am aware. Maybe it'll or has change(d) to dependency on a fixed tag of the fork repo. In that case, the cost for case 3 it irrelevant.
  2. Solution B is not possible at all currently, solution C is possible already, and would just lose its efficiency in case 2 currently. It is possibly less efficient in case 3, assuming that redownloading takes longer than git submodule syncing (not a given, I'd say!).
  3. Solution A is costly, especially in the average case of single build runs per copy.

@MagaTailor
Copy link

MagaTailor commented Jun 17, 2016

Even though we're talking about an initial cost, still, doing the whole build in RAM (ramdisk) is perfectly possible with just 3GB + compression. (probably less without full cloning)

Downloading a source snapshot is preferable in that scenario but using an equivalent shallow clone could become attractive too.

Regardless of anything else, my original issue still stands and looks like the lowest of low hanging fruit.

@Mark-Simulacrum
Copy link
Member

@aidanhs: Can you comment on this? I think you're somewhat familiar with git submodule related issues (caching, non-deep cloning, etc.)

cc #40474 since potentially helpful for resolving that.

@aidanhs
Copy link
Member

aidanhs commented May 7, 2017

This is an interesting issue because it talks about the experience of users on slow connections, rather than CI (which has been my focus) - the solution I have in mind for CI (caching) doesn't end up with any benefits for users. Let's pretend "github doesn't allow shallow submodule clone" was resolved (for the sake of discussion):

For CI I personally think caching is a better avenue to investigate (as I mention on the issue linked above), but any solution suggested here would be a great start.

@Mark-Simulacrum
Copy link
Member

Closing. Shallow clones of submodules aren't feasible today due to not knowing how shallow we need to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants