Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start a new slave per build #22448

Closed
alexcrichton opened this issue Feb 17, 2015 · 8 comments
Closed

Start a new slave per build #22448

alexcrichton opened this issue Feb 17, 2015 · 8 comments
Labels
E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot.

Comments

@alexcrichton
Copy link
Member

It is far too common nowadays that one build on our buildbot will end up corrupting all future builds, requiring a manual login to the bot to fix the state of affairs. The most common reason this happens is a rogue process remains running on Windows, preventing recreation of the executable (e.g. causing the compiler to fail to link on all future test runs).

It would likely be much more robust to start a new slave each time we start a new build. This way we're guaranteed a 100% clean slate when starting a build (e.g. no lingering processes). This would, however, require caching LLVM builds elsewhere and having the first step of the build to download the most recent LLVM build.

As such, this is not an easy bug to implement at all, hence the wishlist status for our infrastructure.

cc @brson

@alexcrichton alexcrichton added E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. I-wishlist labels Feb 17, 2015
@nagisa
Copy link
Member

nagisa commented Feb 17, 2015

Can’t ccache caches be shared between machines somehow, perhaps by storing the cache on a network share? This way we could just ccache llvm.

@alexcrichton
Copy link
Member Author

That sounds like an excellent solution to the problem!

@brson brson mentioned this issue Feb 18, 2015
65 tasks
@brson
Copy link
Contributor

brson commented Feb 18, 2015

A distributed ccache does sound useful here.

Slave per build will take a few more minutes per cycle because of the time it takes to start slaves, but it's not a great cost.

I wonder if we could use a simple distributed rustc cache to speed up half of the auto builds.

@vadimcn
Copy link
Contributor

vadimcn commented Feb 20, 2015

Sounds a bit heavy-handed.
If this were only about Windows, I'd suggest looking into Windows Job Objects. They are basically like containers for process groups, and you can set them up such that all processes get killed when job object is destroyed.

@dotdash
Copy link
Contributor

dotdash commented Mar 4, 2015

Is this still an issue or is this handled by the job group(?) stuff that was implemented for the windows buildbots?

@alexcrichton
Copy link
Member Author

I believe that using dojob.exe has fixed some of our problems (I haven't seen any issues related to this recently), but I would also like to still do this. Right now if you kill a build too quickly it will corrupt the git directory, causing all future builds using that build directory to fail. This currently requires manual intervention to delete the git directory or rebooting the slave in question.

@steveklabnik
Copy link
Member

Triage: not aware of any changes here.

@alexcrichton
Copy link
Member Author

This likely isn't going to happen, we'd basically have to rewrite our entire CI system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot.
Projects
None yet
Development

No branches or pull requests

6 participants