Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cargo is entering an infinite loop #7840

Closed
ehuss opened this issue Jan 28, 2020 · 16 comments · Fixed by #7844
Closed

Cargo is entering an infinite loop #7840

ehuss opened this issue Jan 28, 2020 · 16 comments · Fixed by #7844
Labels
C-bug Category: bug P-high Priority: High

Comments

@ehuss
Copy link
Contributor

ehuss commented Jan 28, 2020

Cargo is randomly entering an infinite loop in some rare cases. Noticed in CI:

I've been able to reproduce it locally by putting Cargo's own test suite into a loop (while cargo test; do :; done after about 30 minutes on my machine). It seems to randomly affect different tests. Attaching with a debugger, I see it caught in this loop, where dropping an mpsc receiver seems to be confused.

I strongly suspect this is caused by some change in #7731. Perhaps there is some odd interaction between crossbeam's scope() and mpsc? The only relevant change that I see is that DrainState is now moved into the crossbeam scope, and thus dropped inside the scope, whereas previously the mpsc channels residing inside JobQueue were dropped much later.

@Mark-Simulacrum Are you very familiar with mpsc or crossbeam's scope? mpsc seems wildly complex, and I don't know where to start trying to figure out what's wrong.

I might try removing FinishOnDrop and see if I can repro again, as that seems a little fishy.

@ehuss ehuss added C-bug Category: bug P-high Priority: High labels Jan 28, 2020
@dtolnay
Copy link
Member

dtolnay commented Jan 28, 2020

cargo clean && cargo check at the repo root at https://github.com/dtolnay/cxx/tree/015c5e85a543c0623ca76aad547ff48d6ba5cefc hits this on my machine every time.

Could we get this issue pinned until a fix is out in nightly?

@alexcrichton
Copy link
Member

I appear to be getting different behavior, although perhaps somewhat related, I'm not sure. Using @dtolnay's repository I can reproduce with cargo +nightly clean && cargo +nightly check pretty regularly, but the problem I'm seeing is infinite memory consumption rather than a hang in an mpsc channel.

I'm unable to reproduce with a locally-built binary with debuginfo (or without debuginfo). It's also not clear to me what's going on with the infinite memory allocation. Using strace it looks like cargo isn't even executing any system calls (like locks or something).

The backtrace always looks like this from what I can tell:

(gdb) bt
#0  0x000055f7db5eeda3 in cargo::util::progress::State::tick ()
#1  0x000055f7db6329b5 in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#2  0x000055f7db4bd73f in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#3  0x000055f7db50c35b in std::panicking::try::do_call ()
#4  0x000055f7dbc9a287 in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:86
#5  0x000055f7db464616 in crossbeam_utils::thread::scope ()
#6  0x000055f7db630821 in cargo::core::compiler::job_queue::JobQueue::execute ()
#7  0x000055f7db4a1d57 in cargo::core::compiler::context::Context::compile ()
#8  0x000055f7db4de327 in cargo::ops::cargo_compile::compile_ws ()
#9  0x000055f7db4d8f23 in cargo::ops::cargo_compile::compile ()
#10 0x000055f7db449175 in cargo::commands::check::exec ()
#11 0x000055f7db3fe90e in cargo::cli::main ()
#12 0x000055f7db4431d6 in cargo::main ()
#13 0x000055f7db430aa3 in std::rt::lang_start::{{closure}} ()
#14 0x000055f7dbc8ee13 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:52
#15 std::panicking::try::do_call () at src/libstd/panicking.rs:305
#16 0x000055f7dbc9a287 in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:86
#17 0x000055f7dbc8f8b0 in std::panicking::try () at src/libstd/panicking.rs:281
#18 std::panic::catch_unwind () at src/libstd/panic.rs:394
#19 std::rt::lang_start_internal () at src/libstd/rt.rs:51
#20 0x000055f7db4457b2 in main ()

making me think it's stuck here

... and then it turns out I can reproduce this hang if I build the exact commit of nightly locally. If I build master locally I cannot reproduce this hang that I am seeing. This means (I think?) that the hang I am personally seeing on @dtolnay's project was fixed in b68b097...c326fcb, although I'm not sure how.

Is it possible there are two "infinite loops" maybe? One on nightly which has been fixed (somehow), which @dtolnay is running into. And another which @ehuss is running into which isn't fixed and is harder to reproduce?

@alexcrichton
Copy link
Member

Ok, I can say for certain now what I am seeing locally.

Nightly, as of this red-hot second, is b68b097. Master, however is c326fcb. Notably, this means that nightly does not have #7829. Turns out that PR not only fixes an assertion, but rather an "infinite allocation".

Inside of this function the bug on nightly is that sometimes cur > max. This means that pct is greater than one, which means that due to this line hashes > display_width. Finally that means that this computation overflows, causing an "infinitely sized string" to get allocated. Naturally that fails, but it takes forever to fail since it exhausts all memory. Consequently, the "infinite loop" I was seeing.

So I can say with confidence what the hang I personally saw was. Now I'd like to clarify.

If it is as I suspect, then there are indeed two hangs, one fixed by #7829 but not deployed to nightlies yet, and another that @ehuss is describing here which looks far more nefarious.

@dtolnay
Copy link
Member

dtolnay commented Jan 28, 2020

Yes that's right! Sorry for hijacking the thread. It looks like rust-lang/rust#68579 has already landed the submodule update to pull in #7829 so tomorrow's nightly will have fixed the one I am hitting. Thanks!

@alexcrichton
Copy link
Member

No worries @dtolnay, glad it's already fixed for you in any case :)

Oh looks like @ehuss already found the infinite loop with memory in discord yesterday, so I do indeed suspect that this is separate.

@ehuss I've been running cargo test in a loop locally for a few hours now and haven't managed to reproduce :(. Given that the test logs show Windows/macOS as failing I suspect that Linux can probably reproduce too eventually...

@Mark-Simulacrum
Copy link
Member

I can try to dig in later today, but if we aren't able to track it down quickly we'll probably want to back the change out (at least from beta Cargo, which I think these patches slipped into?).

Once I have a charger around I can try (and don't mind spinning fans up) and reproduce on macOS and Linux as well.

@alexcrichton
Copy link
Member

Ok after leaving running for quite some time I've found a similar hang in Cargo. (using Linux)

There are two threads, one of them is the jobserver thread though which is known to be blocked. The other looks like this:

#0  0x000055d1056ce4f6 in std::sync::mpsc::shared::Packet<T>::drop_port (self=0x55d10858d670)
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/sync/mpsc/shared.rs:383
#1  0x000055d1056958b5 in <std::sync::mpsc::Receiver<T> as core::ops::drop::Drop>::drop (self=0x7fff85846998)
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/sync/mpsc/mod.rs:1552
#2  0x000055d1054d26d5 in core::ptr::real_drop_in_place () at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/ptr/mod.rs:175
#3  0x000055d1054e48e3 in core::ptr::real_drop_in_place () at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/ptr/mod.rs:175
#4  0x000055d10528967a in cargo::core::compiler::job_queue::DrainState::drain_the_queue (self=..., cx=0x7fff85854898, plan=0x7fff8584bdf8,
    scope=0x7fff85849ad8, jobserver_helper=0x7fff85847220) at src/cargo/core/compiler/job_queue.rs:696
#5  0x000055d105686921 in cargo::core::compiler::job_queue::JobQueue::execute::{{closure}} (scope=0x7fff85849ad8)
    at src/cargo/core/compiler/job_queue.rs:388
#6  0x000055d10565ca15 in crossbeam_utils::thread::scope::{{closure}} ()
    at /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
#7  0x000055d1056f48c6 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=())
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:317
#8  0x000055d10573b0a3 in std::panicking::try::do_call (data=0x7fff858481f8 "\001")
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:287
#9  0x000055d1060e48ba in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#10 0x000055d10573ac0e in std::panicking::try (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:265
#11 0x000055d1056f4aae in std::panic::catch_unwind (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:396
#12 0x000055d10565c2ab in crossbeam_utils::thread::scope (f=...)
    at /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
#13 0x000055d105283961 in cargo::core::compiler::job_queue::JobQueue::execute (self=..., cx=0x7fff85854898, plan=0x7fff8584bdf8)
    at src/cargo/core/compiler/job_queue.rs:388
#14 0x000055d105122834 in cargo::core::compiler::context::Context::compile (self=..., units=..., export_dir=..., exec=0x7fff858556a0)
    at src/cargo/core/compiler/context/mod.rs:162
#15 0x000055d104d5da3b in cargo::ops::cargo_compile::compile_ws (ws=0x7fff85857b80, options=0x7fff858588f0, exec=0x7fff858556a0)
    at src/cargo/ops/cargo_compile.rs:469
#16 0x000055d104d5a645 in cargo::ops::cargo_compile::compile_with_exec (ws=0x7fff85857b80, options=0x7fff858588f0, exec=0x7fff858556a0)
    at src/cargo/ops/cargo_compile.rs:259
#17 0x000055d104d5a502 in cargo::ops::cargo_compile::compile (ws=0x7fff85857b80, options=0x7fff858588f0) at src/cargo/ops/cargo_compile.rs:248
#18 0x000055d104bc4afd in cargo::ops::cargo_doc::doc (ws=0x7fff85857b80, options=0x7fff858588f0) at src/cargo/ops/cargo_doc.rs:65
#19 0x000055d1046e2584 in cargo::commands::doc::exec (config=0x7fff8585a828, args=0x55d10847c4a8) at src/bin/cargo/commands/doc.rs:64
#20 0x000055d10472ec21 in cargo::cli::execute_subcommand (config=0x7fff8585a828, cmd=..., subcommand_args=0x55d10847c4a8) at src/bin/cargo/cli.rs:197
#21 0x000055d10472cc27 in cargo::cli::main (config=0x7fff8585a828) at src/bin/cargo/cli.rs:107
#22 0x000055d104737a10 in cargo::main () at src/bin/cargo/main.rs:39

Line numbers are relative to c326fcb

As the original author of the mpsc module in the standard library I should know what's going on here. I... don't think, and I actually suspect that this is just a preexisting bug in the standard library. My guess is that there's a race when tokens are sent while a channel is being dropped, where previously Cargo cleaned up everything before dropping tokens and now the new order you pointed out @ehuss is probably triggering a preexisting bug.

I tried to do a bit of debugging in the module itself but honestly this code is so far out of cache and it's so complicated it's probably a lost cause.

I would fix this in one of two ways:

  • One would be to try to reorder destructors to not trigger this bug. I think @ehuss is right that we just need to drop things in the original order.
  • Another would be to switch to another queue altogether. I'd probably choose https://crates.io/crates/crossbeam-queue, but with @ehuss's change to SyncSender I'm not sure that crossbeam-queue has an analogue?

@Mark-Simulacrum
Copy link
Member

I believe https://docs.rs/crossbeam-channel/0.4.0/crossbeam_channel/ is the "mpsc" replacement in crossbeam, and it has the same bounded/unbounded API I believe.

I think that's probably the better solution rather than trying to get destructor ordering "correct" -- particularly as I believe none of us actually know why std's impl hangs, so it seems likely that other than restoring prior behavior we probably wouldn't be able to make much headway in making sure it's actually correct.

@ehuss
Copy link
Contributor Author

ehuss commented Jan 28, 2020

Yea, sorry for the confusion, there are two (independent) infinite loops. I tried to get the fix in for the progress bar before nightly was built, but rust-lang/rust hit this bug and we lost 4 hours and missed the window.

Yea, I think we'll need to back it out on beta. I was hoping to avoid this by delaying the update until after the beta branch, but someone else updated cargo.

I'm glad you were able to catch it on linux, I was never able to repro on linux. It is probably very sensitive to timing and threading behavior.

I don't think crossbeam_channel would be a good choice as-is. I was testing it a few days ago for #7838, and found the performance to be substantially worse.

@sourcefrog
Copy link
Contributor

Sounds like you have it under control, but in case it helps here is a sample on osx.
Sample of cargo.txt

@mati865
Copy link
Contributor

mati865 commented Jan 28, 2020

FWIW Cargo has been updated by rust-lang/rust#68407 in order to include #7826.

Beta promotion has some problems rust-lang/rust#68595 so maybe there is still time to fix it.

@spastorino
Copy link
Member

spastorino commented Jan 28, 2020

Did what @Mark-Simulacrum have suggest, but unsure if it fixes the problem you're seeing.

@nagisa
Copy link
Member

nagisa commented Jan 28, 2020

I can reproduce this in 100% of the cases with current nightly. Once it hangs it quickly consumes all the memory and gets killed by OOM killer. I can push up the crate with modifications if y'all still need to reproduce it.

@mati865
Copy link
Contributor

mati865 commented Jan 28, 2020

It affects current nightly but the fix will be included in next nightly.

@asomers
Copy link

asomers commented Jan 28, 2020

FWIW I can reliably and quickly reproduce the infinite allocation bug on FreeBSD, just by trying to build almost any crate. Nix, for example, triggers the bug, but libc does not.

@huxi
Copy link

huxi commented Jan 29, 2020

Latest nightly rustc 1.42.0-nightly (3761dcd34 2020-01-28) fixed the problem for me.

alexcrichton added a commit to alexcrichton/cargo that referenced this issue Jan 29, 2020
We don't need the complexity of most channels since this is not a
performance sensitive part of Cargo, nor is it likely to be so any time
soon. Coupled with recent bugs (rust-lang#7840) we believe in `std::sync::mpsc`,
let's just not use that and use a custom queue type locally which should
be amenable to a blocking push soon too.
bors added a commit that referenced this issue Jan 30, 2020
Swap std::sync::mpsc channel with crossbeam_channel

Hoping it closes #7840

r? @Mark-Simulacrum
@bors bors closed this as completed in 20ddff8 Jan 30, 2020
bors added a commit that referenced this issue Jan 31, 2020
[beta] Revert scalable jobserver.

This reverts #7731, #7829, and #7836.

This should prevent #7840 on beta. I feel more comfortable reverting this than merging #7844, since the impact is still unknown.
ehuss pushed a commit to ehuss/cargo that referenced this issue Mar 8, 2020
We don't need the complexity of most channels since this is not a
performance sensitive part of Cargo, nor is it likely to be so any time
soon. Coupled with recent bugs (rust-lang#7840) we believe in `std::sync::mpsc`,
let's just not use that and use a custom queue type locally which should
be amenable to a blocking push soon too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: bug P-high Priority: High
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants