Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: run static and non-static Rust builds in parallel #1368

Merged
merged 2 commits into from
Mar 8, 2021

Conversation

tjkirch
Copy link
Contributor

@tjkirch tjkirch commented Mar 5, 2021

Description of changes:

There's a long tail of crate builds for each that can be mitigated by running
them in parallel, saving a fair amount of time.

This saves ~1.5 minutes on every build! (On a fast machine.) 🎉

(Parallel cargo has been safe for a while now, for reference.)

Other options considered:

  • A Makefile that runs each part in parallel. Discarded because it's another file in another format, because it'd be harder to separate output of each job, and because we don't expect our needs in this area to expand.
  • Shell functions to handle background jobs more generically, saving PIDs and output files. Discarded because it'd be harder to have pertinent messaging around each job, and because it added a fair amount of complexity where we don't expect our needs to expand.
  • Similar to this PR, but running both jobs in the background and saving output to separate files. Discarded because it adds a bit more complexity without clear benefit, and I hold out hope for a way to see logs from package builds as they're running, though today they seem to be buffered and only written to the log file when complete.

Testing done:

With the change, I see static and non-static rustc processes at the same time. If I touch sources/**/*.rs, building the os package is pretty consistent at 4 minutes 18 seconds.

Without the change, I see non-static and then static processes serially, and it's pretty consistent at around 5 minutes 48 seconds.

I tested the failure cases by adding a nonexistent crate name to the build list for static and non-static (in separate runs). With a fake package in non-static list: quick failure, clear error. With a fake package in static list: still get a clear error, though after a few minutes, when the non-static build has finished. (Since static builds used to run afterward, it's about the same delay.)

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

There's a long tail of crate builds for each that can be mitigated by running
them in parallel, saving a fair amount of time.
@tjkirch tjkirch requested a review from bcressey March 5, 2021 21:35
@tjkirch tjkirch marked this pull request as draft March 5, 2021 22:09
@tjkirch
Copy link
Contributor Author

tjkirch commented Mar 5, 2021

In about half of the CI runs, with no obvious pattern, this seems to have had a bad interaction with the build.rs logic for models that creates a needed symlink. I didn't see this locally. Setting the PR back to draft while I figure it out.

[edit] In all 7 failure cases, the static build (which is now running earlier) failed to find the link created by models build.rs. It seems build.rs may be skipped every time in the static build (where it should only be skipped after the first time) and the static build hits that point before the non-static build can create the link, so it's failing. I'm going to check whether the CARGO_CFG_TARGET_VENDOR we're using to determine if we should run build.rs is different in static builds.

[edit] That's basically it. For non-static runs I see vendor "bottlerocket" and "unknown", and for static it's only "bottlerocket". The build.rs exits early when it sees "bottlerocket" so we don't run it twice. It worked before because static always ran afterward, when the link already existed from non-static builds. Half of the builds succeeding is almost surely due to chance, i.e. the non-static builds racing to models build.rs first.

This is a partial revert of 79465cd; the nicer
structure for main() is kept, but the actual early-exit is removed.

We don't actually need the early-exit if it's safe to run this build.rs in
parallel, which it is since 3943a32 introduced
safe link swapping.  The other action it takes, README generation, is skipped
during production builds, and would be obvious if it went wrong during local
builds due to version controlled changes, though it should also be safe there
as long as our READMEs don't balloon and become unsafe to write in parallel.

The reason for removing this is that it effectively worked by chance, because
we were running non-static builds before static builds.  We'd like to run both
sets of builds in parallel to save a significant amount of time.  Static
builds, as run in os.spec, only run build.rs once, with a vendor of
"bottlerocket".  This means build.rs was skipped, and unless non-static builds
won the race and ran their build.rs before static builds got to that crate, the
link wouldn't be created and the build would fail.
@tjkirch
Copy link
Contributor Author

tjkirch commented Mar 5, 2021

^ I added a commit that addresses the build issue mentioned above.

models build.rs: remove early exit for target toolchain

This is a partial revert of 79465cd6d3f1e6f121dd15915a0346fee2bbe9ac; the nicer
structure for main() is kept, but the actual early-exit is removed.

We don't actually need the early-exit if it's safe to run this build.rs in
parallel, which it is since 3943a32212a424c6dc0c0c97e3f15b0f627e937c introduced
safe link swapping.  The other action it takes, README generation, is skipped
during production builds, and would be obvious if it went wrong during local
builds due to version controlled changes, though it should also be safe there
as long as our READMEs don't balloon and become unsafe to write in parallel.

The reason for removing this is that it effectively worked by chance, because
we were running non-static builds before static builds.  We'd like to run both
sets of builds in parallel to save a significant amount of time.  Static
builds, as run in os.spec, only run build.rs once, with a vendor of
"bottlerocket".  This means build.rs was skipped, and unless non-static builds
won the race and ran their build.rs before static builds got to that crate, the
link wouldn't be created and the build would fail.

I tested locally to confirm that build.rs was run for the static build.

[edit] CI passed all variants. Going to rerun it all to get a little more confidence.

@tjkirch
Copy link
Contributor Author

tjkirch commented Mar 6, 2021

Passed two full CI runs after the fix above, so I'm setting this as ready to review again.

@tjkirch tjkirch marked this pull request as ready for review March 6, 2021 00:14
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

📸

Copy link
Contributor

@srgothi92 srgothi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
A basic question, I am trying to understand what makes cargo build run sequential vs parallel. Does %{nil} at the end of %cargo_build_static command makes cargo build sequential and wait for build to complete ?

@tjkirch
Copy link
Contributor Author

tjkirch commented Mar 8, 2021

Does %{nil} at the end of %cargo_build_static command makes cargo build sequential and wait for build to complete ?

@srgothi92 The %{nil} was just there in case there were no migrations in the migrations list, so that the shell wouldn't assume the next line is part of the command in that case. rpm expands it to nothing, so it was just a way to definitely end the command string there. In the new version, we have part of the command after the migrations list (the redirections and backgrounding) so we don't need that extra marker, we know we have the end of the command there regardless of migrations.

What made it sequential was that they were shell commands run one after the other, without either of them running in the background or anything. With this change we're explicitly asking the shell to run the first job in the background with & so we can run another one in parallel.

@srgothi92
Copy link
Contributor

What made it sequential was that they were shell commands run one after the other, without either of them running in the background or anything. With this change we're explicitly asking the shell to run the first job in the background with & so we can run another one in parallel.

Ahh, I see so & does the magic. Thanks for such detailed explanation.

@tjkirch tjkirch merged commit a290006 into bottlerocket-os:develop Mar 8, 2021
@tjkirch tjkirch deleted the os-parallel branch March 8, 2021 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants