-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate caching .git data on Travis/AppVeyor #40772
Comments
On the Travis side, it should just be a matter of enumerating all of the directories you want: https://docs.travis-ci.com/user/caching/#Arbitrary-directories |
Yeah unfortunately I don't actually know what directories are cached here. Is it just If it's just |
I've made a stab in #40780. It's not as simple as caching the .git directory - by the time the script runs, you already have a branch checked out and it's not really clear how you'd move the objects around. It's also not as simple as just copying Instead, I've tried to go the route of using the As an aside, I'm a bit suspicious that
Seems like it has to fetch the whole thing anyway in order to get my branch? Maybe worth raising an issue on the travis issue tracker to ask if this is expected? |
FWIW, CircleCI is a bit vague about how their git caching works: https://circleci.com/docs/1.0/how-cache-works/#git-cache
This opacity combined with complaints about the caching ([1], [2]) is not reassuring. I could hazard some guesses at how they're doing it, but it's likely going to be more complicated and therefore less reliable than an approach tailored to the rust repo. |
We used to do this on buildbot didn't we? IIRC corrupted .git was pretty common occurrence. |
I assume you're referring to #34595, which looks like it happens because the cache may contain a partially cloned/corrupt git submodule. This was a problem even without caching (network failure then retry is unhappy because of this same corrupted state) and was seemingly resolved by deinit before update (#39055) - the same fix may have fixed the buildbots (had they still been around). That said, I can think of two approaches for validating before caching off the top of my head (the second problem with the buildbots, aside from not doing deinit, was caching continuously rather than only when the cache was valid) - if we see issues (or you want a more paranoid approach to begin with) I'll be able to put something in place fairly quickly. |
Thanks for the investigation @aidanhs! I'll take a look at #40780 soon. Also yeah @nagisa I'd want to be careful about a solution here. Bad caching can cause unending problems, so I'd want to make sure we're always in a situation where the tool we're caching is relatively robust to odd cache entries. For example |
Call me stupid (since I probably am missing something obvious), but why do Travis CI and Circle CI workers even download the commit history? GitHub has the "download a tarball from https://github.com/rust-lang/rust/archive/master.tar.gz" option, which is probably way faster. |
@notriddle if you already have a repo of N commits, downloading the N+1th commit with That said, Travis definitely has something sub-optimal with cloning - it should be doing a shallow clone of the branch, but instead it does a shallow clone of master and then a full clone of PR branches (defeating the point of the shallow clone). |
…lexcrichton Attempt to cache git modules Partial resolution of rust-lang#40772, appveyor remains to be done once travis looks like it's working ok. The approach in this PR is based on the `--reference` flag to `git-clone`/`git-submodule --update` and is a compromise based on the current limitations of the tools we're using. The ideal would be: 1. have a cached pristine copy of rust-lang/rust master in `$HOME/rustsrc` with all submodules initialised 2. clone the PR branch with `git clone --recurse-submodules --reference $HOME/rustsrc git@github.com:rust-lang/rust.git` This would (in the nonexistent ideal world) use the pristine copy as an object cache for the top level repo and all submodules, transferring over the network only the changes on the branch. Unfortunately, a) there is no way to manually control the initial clone with travis and b) even if there was, cloned submodules don't use the submodules of the reference as an object cache. So the steps we end up with are: 1. have a cached pristine copy of rust-lang/rust master in `$HOME/rustsrc` with all submodules initialised 2. have a cloned PR branch 3. extract the path of each submodule, and explicitly `git submodule update --init --reference $HOME/rustsrc/$module $module` (i.e. point directly to the location of the pristine submodule repo) for each one I've also taken some care to make this forward compatible, both for adding and removing submodules. r? @alexcrichton
After a bit of a shaky start (merge completed successfully, then subsequent merges failed because appveyor caching was broken) just the appveyor part was rolled back and so just travis builds are currently using the new repo caching and it seems to be working ok. In the middle of trying to fix appveyor last night, I realised that there is a way for me to test appveyor - fork rust, then comment out all CI that actually does any rust compilation etc. That way I'll be able to just test the caching. I'll be back soon with a tested PR for appveyor. |
Thanksd for the continuing investigation @aidanhs! |
(oops didn't mean to close) |
Unfortunately, I was completely unable to reproduce the cache restore failure on appveyor, despite trying a number of times. However, after reviewing the logs again, I don't actually think the cache restore failure was the issue, I think it was a different buggy part of the appveyor.yml. There's a new PR to re-enable appveyor at #41075. Cache aside, as part of this issue I do think it's worth looking into this sequence of commands on travis:
I described this above:
The extra time spent here makes the cache a little less effective (40s wasted), and consumes about 60-75% of the time on the no-op PR builds - that's ~ 35x45s across Linux no-op builds and ~ 5x190s across OSX no-op builds, totalling 40-45mins of dead time per PR push! Seems massively wasteful, even if it is in parallel. I've stumbled across something that looks like the ideal solution - appveyor lets you implement a custom |
Ok, the inefficiency in travis has been spotted before - travis-ci/travis-ci#6183, travis-ci/travis-build#747. I guess it just needs resurrecting and fixing. Appveyor seems to do it like so:
|
@aidanhs thanks for the investigation! Should we file an upstream travis bug for that? |
@alexcrichton nah I'll just make a PR (resurrecting the one I linked above) at some point in the next week or so. Well, you can raise an issue if you want one for tracking purposes :) |
Sounds good to me, thanks! |
Some updates:
|
Some updates since I've paused work on this (partially to rethink, see 1b):
|
Raised http://help.appveyor.com/discussions/problems/6735-corrupt-caches about the corrupt caches. We've had to implement @notriddle's suggestion to 'temporarily' use .tar.gz files for the llvm submodule in #42211 since it was causing builds to timeout when combined with the current appveyor network issues. I'm not delighted, but while caches don't work there aren't really any other great options I'm aware of. |
Appveyor have replied to the issue effectively acknowledging the problem and saying it'll be fixed with cache "v2". |
Triage: not aware of any movement here |
We now use Github Actions instead of Travis and AppVeyor. For git submodules we do some magic whereby we download an archive containing just the right version we need and the trick git into thinking it is a regular submodule I believe. Can this issue be closed? |
The rust-lang/rust repo itself takes awhile to clone but we somewhat mitigate that with
--depth=1
clones. Our submodules, however, are much larger and unfortunately cannot be cloned with a--depth
argument due to how the branches work. This typically means that cloning the LLVM repo takes quite a long time! Unfortunately this also increases our chances to network problems by requiring a lot of data to move over the network.When playing around with CircleCI recently I found that they automatically cached git repository data which greatly sped up cloning the repository and checking out submodules. Overall it felt quite nifty! We should investigate to see if a similar strategy can apply to Travis and/or AppVeyor. I'm not personally familiar with how CircleCI's git caching works, so some investigation there would be needed (and comments if you're familiar with it would be most welcome!)
Overall I would expec this change to:
Any help to implement this would be very much appreciated!
The text was updated successfully, but these errors were encountered: