Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triaging/improving the number of crates classified as build-fail #589

Open
saethlin opened this issue Dec 25, 2021 · 2 comments
Open

Triaging/improving the number of crates classified as build-fail #589

saethlin opened this issue Dec 25, 2021 · 2 comments

Comments

@saethlin
Copy link
Member

saethlin commented Dec 25, 2021

I'm writing this up because on the Zulip, simulacrum suggested nobody had done this before. If this is already known I hope i odn't seem patronizing by writing it up.

There were a number of regressions related to a recent LLVM version bump which resulted in rampant resource utilization for a number of crates. To a non-expert this is confusing, shouldn't crater have detected that? Unfortunately, it looks like crater has a serious problem with OOMs generally: #564 #562 #544 #516 #490 #484. I'm hoping that by looking into a lot of the available logs I can make some suggestions, maybe put in a PR (though I've never worked in this codebase) and generally improve the quality of crater runs.

I did some very basic poking around all of 1.54, 1.55, 1.56, 1.57, and 1.58 runs, only considering published crates. It looks like the sort of failures we see is basically the same from version to version. So I focused on just 1.58 because I'm much more concerned about systematic behavior among the build failures and what can be done about it.

The 1.58 run has 14,921 build-fail/reg crates. Of those...

  • 7,545 have some sort of rustc error with an error code which could conceivably be parsed
  • 2,560 have some kind of custom build script error. Mostly these are failed attempts to locate or build a C/C++ dependency
  • 1,409 blindly try to link against some library that doesn't exist on the system
  • 389 contain #!{experimental]
  • 370 encounter some other kind of OOM, most often in the linker, then in compiling a C/C++ dependency
  • 193 try to include_bytes!/include_str! some file that's not in the repo
  • 162 time out with no output
  • 114 experience some other kind of linker error (one crate even tries to build itself with asan and fails)
  • 24 have a truncated log
    There are also a lot that I didn't categorize, such as attempts to compile with macOS frameworks, use of llvm_asm!, missing eh_personality, and a lot of crates that require the user to turn on a non-default feature to build.

My biggest concern with the current setup is that the number of CPUs that a build is spawned with is sporadic, and this alone causes a significant number of OOMs. The most hilarious case of this that I've found is memx. The author has quite diligently written 35 integration test binaries, which means on a 64-core machine each integration test has only 44 MB to work with. That's enough for rustc actually, but not for all the ld processes. regex fails most crater runs for the same reason, but its codebase is much more memory-intensive. With only 4 CPUs, regex will OOM building tests.

This is why the vast, vast majority of spurious-fixed and spurious-regressed crates are regressed to or fixed from an OOM. They OOM when they're randomly assigned to an environment that happens to have too many CPUs, then most likely are assigned to an environment next time that has far fewer.

The build timeouts (no output for 300 seconds) are also interesting. Since there are only 162 of them, I tried to reproduce all of them myself. Most of them are not reproducible. But I did find a few true positives lurking in there:

  • savage, and sdc-parser push the 1.5 GB limit even with a single job. They probably look like a timeout but only on account of the memory limit.
  • fungui could possibly be considered compiler hangs on Rust 1.57, but not on the current nightly. It's not clear to me if crater could have spotted a compile time regression that it otherwise missed if this were noticed.
  • ilvm needs 30 minutes to compile on Rust 1.57, the 1.58 beta, and current nightly. I think it qualifies as a compiler hang, the codebase is pretty small and simple for that long of a compilation.

If we only saw 4 build timeouts instead of 162, perhaps they could have been manually inspected on every crater run. So perhaps there's an opportunity here?

Some ideas:

The root problem with all the spurious OOMs is that the peak memory usage of a build scales with the number of CPUs available, but crater doesn't scale the available memory up even as it scales the number of available CPUs randomly by a factor of 10 or more. Setting a job limit on cargo would only be a partial solution because there are plenty of build scripts that compile C libraries that fan out parallelism to the number of CPUs detected. I think it would be a huge improvement to limit the number of CPUs or provide a memory limit that scales up with the number of CPUs.

The build timeouts as well as those things that crater summaries already categorize as are also quite interesting. Quite a few just look like this:

[INFO] fetching crate asfa 0.9.0...
[ERROR] this task or one of its parent failed!
[ERROR] no output for 300 seconds
[ERROR] note: run with `RUST_BACKTRACE=1` to display a backtrace.

This crate with the same version was test-pass in 1.56, test-fail in 1.57, and error in 1.58. This sort of output smells like a transient network error. Is there a retry mechanism for crate builds? And even if there isn't, it would be good to get a lot more logging related to downloads so that we could have more hope of diagnosing these. The error crates aren't build-fail (which is what the title says) but this seems like the same pathology as the timeout crates suffer from; almost like a sudden loss of networking.

@saethlin
Copy link
Member Author

The OOM problem seems to disproportionately afflict the most-downloaded crates (serde_json, time, hashbrown, and many others), because they often or always OOM while running rustdoc, which gets classified in a crater report as test-fail, though the cause is noted as "test OOM".

202 of the top 1,000 crates are currently routinely failing crater runs. Of those 202, at least 76 are OOMs.

@graydon
Copy link

graydon commented Sep 11, 2024

A few years later, a new datapoint, now on the rust 1.81 release cycle: I was just looking through the crater run for PR 116088 to see why it didn't report a failure on one of my crates, and it's because .. it wouldn't compile at all. Because it was SIGKILL'ed trying, on both master and try. So I had a look through the logs.

In the build-fail directory, I ran this:

rg --glob master*.txt 'rustc --crate-name.*SIGKILL' | perl -ne 'if (m/--crate-name (\w+)/) { print "$1\n" }' | sort >not-compile.txt

This extracted every invocation of rustc from the master runs that ended in a SIGKILL, which I'm taking as a likely OOM (though of course there are other possibilities).

That produced a file with 18264 lines -- 18k of the 121k build-fails are SIGKILLs. And they're fairly concentrated: only 2121 unique crates were being built when the signal arrived, and of that only 21 crates happened more than 100 times:

$ uniq -c not-compile.txt | grep '^ *[0-9][0-9][0-9]'
   2504 ash
    397 brotli
    145 cranelift_codegen
    773 gtk
    617 libsecp256k1
    495 naga
    526 nix
    142 polars_core
    142 protobuf
    174 regex_automata
    481 regex_syntax
    125 rustix
    105 rustls
    189 serenity
    209 simba
   4311 syn
    288 target_build_utils
   1052 tokio
    141 wayland_protocols
    108 x11_dl
    133 x11rb_protocol

A lot of these are fairly memory-intensive to build. I think it's likely that they are in fact all OOMs. In which case you're losing a fair amount of signal to a memory limit.

How bad / implausible would it be to raise the limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants