Cargo is entering an infinite loop #7840

ehuss · 2020-01-28T05:20:39Z

Cargo is randomly entering an infinite loop in some rare cases. Noticed in CI:

I've been able to reproduce it locally by putting Cargo's own test suite into a loop (while cargo test; do :; done after about 30 minutes on my machine). It seems to randomly affect different tests. Attaching with a debugger, I see it caught in this loop, where dropping an mpsc receiver seems to be confused.

I strongly suspect this is caused by some change in #7731. Perhaps there is some odd interaction between crossbeam's scope() and mpsc? The only relevant change that I see is that DrainState is now moved into the crossbeam scope, and thus dropped inside the scope, whereas previously the mpsc channels residing inside JobQueue were dropped much later.

@Mark-Simulacrum Are you very familiar with mpsc or crossbeam's scope? mpsc seems wildly complex, and I don't know where to start trying to figure out what's wrong.

I might try removing FinishOnDrop and see if I can repro again, as that seems a little fishy.

The text was updated successfully, but these errors were encountered:

dtolnay · 2020-01-28T05:55:06Z

cargo clean && cargo check at the repo root at https://github.com/dtolnay/cxx/tree/015c5e85a543c0623ca76aad547ff48d6ba5cefc hits this on my machine every time.

Could we get this issue pinned until a fix is out in nightly?

alexcrichton · 2020-01-28T06:39:57Z

I appear to be getting different behavior, although perhaps somewhat related, I'm not sure. Using @dtolnay's repository I can reproduce with cargo +nightly clean && cargo +nightly check pretty regularly, but the problem I'm seeing is infinite memory consumption rather than a hang in an mpsc channel.

I'm unable to reproduce with a locally-built binary with debuginfo (or without debuginfo). It's also not clear to me what's going on with the infinite memory allocation. Using strace it looks like cargo isn't even executing any system calls (like locks or something).

The backtrace always looks like this from what I can tell:

(gdb) bt
#0  0x000055f7db5eeda3 in cargo::util::progress::State::tick ()
#1  0x000055f7db6329b5 in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#2  0x000055f7db4bd73f in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#3  0x000055f7db50c35b in std::panicking::try::do_call ()
#4  0x000055f7dbc9a287 in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:86
#5  0x000055f7db464616 in crossbeam_utils::thread::scope ()
#6  0x000055f7db630821 in cargo::core::compiler::job_queue::JobQueue::execute ()
#7  0x000055f7db4a1d57 in cargo::core::compiler::context::Context::compile ()
#8  0x000055f7db4de327 in cargo::ops::cargo_compile::compile_ws ()
#9  0x000055f7db4d8f23 in cargo::ops::cargo_compile::compile ()
#10 0x000055f7db449175 in cargo::commands::check::exec ()
#11 0x000055f7db3fe90e in cargo::cli::main ()
#12 0x000055f7db4431d6 in cargo::main ()
#13 0x000055f7db430aa3 in std::rt::lang_start::{{closure}} ()
#14 0x000055f7dbc8ee13 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:52
#15 std::panicking::try::do_call () at src/libstd/panicking.rs:305
#16 0x000055f7dbc9a287 in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:86
#17 0x000055f7dbc8f8b0 in std::panicking::try () at src/libstd/panicking.rs:281
#18 std::panic::catch_unwind () at src/libstd/panic.rs:394
#19 std::rt::lang_start_internal () at src/libstd/rt.rs:51
#20 0x000055f7db4457b2 in main ()

making me think it's stuck here

... and then it turns out I can reproduce this hang if I build the exact commit of nightly locally. If I build master locally I cannot reproduce this hang that I am seeing. This means (I think?) that the hang I am personally seeing on @dtolnay's project was fixed in b68b097...c326fcb, although I'm not sure how.

Is it possible there are two "infinite loops" maybe? One on nightly which has been fixed (somehow), which @dtolnay is running into. And another which @ehuss is running into which isn't fixed and is harder to reproduce?

alexcrichton · 2020-01-28T07:10:34Z

Ok, I can say for certain now what I am seeing locally.

Nightly, as of this red-hot second, is b68b097. Master, however is c326fcb. Notably, this means that nightly does not have #7829. Turns out that PR not only fixes an assertion, but rather an "infinite allocation".

Inside of this function the bug on nightly is that sometimes cur > max. This means that pct is greater than one, which means that due to this line hashes > display_width. Finally that means that this computation overflows, causing an "infinitely sized string" to get allocated. Naturally that fails, but it takes forever to fail since it exhausts all memory. Consequently, the "infinite loop" I was seeing.

So I can say with confidence what the hang I personally saw was. Now I'd like to clarify.

@ehuss are you using master and/or testing Cargo builds after Store maximum queue length #7829?
@dtolnay you're using nightlies, right?

If it is as I suspect, then there are indeed two hangs, one fixed by #7829 but not deployed to nightlies yet, and another that @ehuss is describing here which looks far more nefarious.

dtolnay · 2020-01-28T07:15:40Z

Yes that's right! Sorry for hijacking the thread. It looks like rust-lang/rust#68579 has already landed the submodule update to pull in #7829 so tomorrow's nightly will have fixed the one I am hitting. Thanks!

alexcrichton · 2020-01-28T09:22:46Z

No worries @dtolnay, glad it's already fixed for you in any case :)

Oh looks like @ehuss already found the infinite loop with memory in discord yesterday, so I do indeed suspect that this is separate.

@ehuss I've been running cargo test in a loop locally for a few hours now and haven't managed to reproduce :(. Given that the test logs show Windows/macOS as failing I suspect that Linux can probably reproduce too eventually...

Mark-Simulacrum · 2020-01-28T12:32:43Z

I can try to dig in later today, but if we aren't able to track it down quickly we'll probably want to back the change out (at least from beta Cargo, which I think these patches slipped into?).

Once I have a charger around I can try (and don't mind spinning fans up) and reproduce on macOS and Linux as well.

alexcrichton · 2020-01-28T14:17:10Z

Ok after leaving running for quite some time I've found a similar hang in Cargo. (using Linux)

There are two threads, one of them is the jobserver thread though which is known to be blocked. The other looks like this:

#0  0x000055d1056ce4f6 in std::sync::mpsc::shared::Packet<T>::drop_port (self=0x55d10858d670)
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/sync/mpsc/shared.rs:383
#1  0x000055d1056958b5 in <std::sync::mpsc::Receiver<T> as core::ops::drop::Drop>::drop (self=0x7fff85846998)
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/sync/mpsc/mod.rs:1552
#2  0x000055d1054d26d5 in core::ptr::real_drop_in_place () at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/ptr/mod.rs:175
#3  0x000055d1054e48e3 in core::ptr::real_drop_in_place () at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/ptr/mod.rs:175
#4  0x000055d10528967a in cargo::core::compiler::job_queue::DrainState::drain_the_queue (self=..., cx=0x7fff85854898, plan=0x7fff8584bdf8,
    scope=0x7fff85849ad8, jobserver_helper=0x7fff85847220) at src/cargo/core/compiler/job_queue.rs:696
#5  0x000055d105686921 in cargo::core::compiler::job_queue::JobQueue::execute::{{closure}} (scope=0x7fff85849ad8)
    at src/cargo/core/compiler/job_queue.rs:388
#6  0x000055d10565ca15 in crossbeam_utils::thread::scope::{{closure}} ()
    at /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
#7  0x000055d1056f48c6 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=())
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:317
#8  0x000055d10573b0a3 in std::panicking::try::do_call (data=0x7fff858481f8 "\001")
    at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:287
#9  0x000055d1060e48ba in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:78
#10 0x000055d10573ac0e in std::panicking::try (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panicking.rs:265
#11 0x000055d1056f4aae in std::panic::catch_unwind (f=...) at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/panic.rs:396
#12 0x000055d10565c2ab in crossbeam_utils::thread::scope (f=...)
    at /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-utils-0.7.0/src/thread.rs:161
#13 0x000055d105283961 in cargo::core::compiler::job_queue::JobQueue::execute (self=..., cx=0x7fff85854898, plan=0x7fff8584bdf8)
    at src/cargo/core/compiler/job_queue.rs:388
#14 0x000055d105122834 in cargo::core::compiler::context::Context::compile (self=..., units=..., export_dir=..., exec=0x7fff858556a0)
    at src/cargo/core/compiler/context/mod.rs:162
#15 0x000055d104d5da3b in cargo::ops::cargo_compile::compile_ws (ws=0x7fff85857b80, options=0x7fff858588f0, exec=0x7fff858556a0)
    at src/cargo/ops/cargo_compile.rs:469
#16 0x000055d104d5a645 in cargo::ops::cargo_compile::compile_with_exec (ws=0x7fff85857b80, options=0x7fff858588f0, exec=0x7fff858556a0)
    at src/cargo/ops/cargo_compile.rs:259
#17 0x000055d104d5a502 in cargo::ops::cargo_compile::compile (ws=0x7fff85857b80, options=0x7fff858588f0) at src/cargo/ops/cargo_compile.rs:248
#18 0x000055d104bc4afd in cargo::ops::cargo_doc::doc (ws=0x7fff85857b80, options=0x7fff858588f0) at src/cargo/ops/cargo_doc.rs:65
#19 0x000055d1046e2584 in cargo::commands::doc::exec (config=0x7fff8585a828, args=0x55d10847c4a8) at src/bin/cargo/commands/doc.rs:64
#20 0x000055d10472ec21 in cargo::cli::execute_subcommand (config=0x7fff8585a828, cmd=..., subcommand_args=0x55d10847c4a8) at src/bin/cargo/cli.rs:197
#21 0x000055d10472cc27 in cargo::cli::main (config=0x7fff8585a828) at src/bin/cargo/cli.rs:107
#22 0x000055d104737a10 in cargo::main () at src/bin/cargo/main.rs:39

Line numbers are relative to c326fcb

As the original author of the mpsc module in the standard library I should know what's going on here. I... don't think, and I actually suspect that this is just a preexisting bug in the standard library. My guess is that there's a race when tokens are sent while a channel is being dropped, where previously Cargo cleaned up everything before dropping tokens and now the new order you pointed out @ehuss is probably triggering a preexisting bug.

I tried to do a bit of debugging in the module itself but honestly this code is so far out of cache and it's so complicated it's probably a lost cause.

I would fix this in one of two ways:

One would be to try to reorder destructors to not trigger this bug. I think @ehuss is right that we just need to drop things in the original order.
Another would be to switch to another queue altogether. I'd probably choose https://crates.io/crates/crossbeam-queue, but with @ehuss's change to SyncSender I'm not sure that crossbeam-queue has an analogue?

Mark-Simulacrum · 2020-01-28T14:25:59Z

I believe https://docs.rs/crossbeam-channel/0.4.0/crossbeam_channel/ is the "mpsc" replacement in crossbeam, and it has the same bounded/unbounded API I believe.

I think that's probably the better solution rather than trying to get destructor ordering "correct" -- particularly as I believe none of us actually know why std's impl hangs, so it seems likely that other than restoring prior behavior we probably wouldn't be able to make much headway in making sure it's actually correct.

ehuss · 2020-01-28T15:08:41Z

Yea, sorry for the confusion, there are two (independent) infinite loops. I tried to get the fix in for the progress bar before nightly was built, but rust-lang/rust hit this bug and we lost 4 hours and missed the window.

Yea, I think we'll need to back it out on beta. I was hoping to avoid this by delaying the update until after the beta branch, but someone else updated cargo.

I'm glad you were able to catch it on linux, I was never able to repro on linux. It is probably very sensitive to timing and threading behavior.

I don't think crossbeam_channel would be a good choice as-is. I was testing it a few days ago for #7838, and found the performance to be substantially worse.

sourcefrog · 2020-01-28T15:16:29Z

Sounds like you have it under control, but in case it helps here is a sample on osx.
Sample of cargo.txt

mati865 · 2020-01-28T15:25:47Z

FWIW Cargo has been updated by rust-lang/rust#68407 in order to include #7826.

Beta promotion has some problems rust-lang/rust#68595 so maybe there is still time to fix it.

spastorino · 2020-01-28T15:48:55Z

Did what @Mark-Simulacrum have suggest, but unsure if it fixes the problem you're seeing.

nagisa · 2020-01-28T16:47:06Z

I can reproduce this in 100% of the cases with current nightly. Once it hangs it quickly consumes all the memory and gets killed by OOM killer. I can push up the crate with modifications if y'all still need to reproduce it.

mati865 · 2020-01-28T16:54:36Z

It affects current nightly but the fix will be included in next nightly.

asomers · 2020-01-28T21:29:06Z

FWIW I can reliably and quickly reproduce the infinite allocation bug on FreeBSD, just by trying to build almost any crate. Nix, for example, triggers the bug, but libc does not.

huxi · 2020-01-29T05:56:25Z

Latest nightly rustc 1.42.0-nightly (3761dcd34 2020-01-28) fixed the problem for me.

We don't need the complexity of most channels since this is not a performance sensitive part of Cargo, nor is it likely to be so any time soon. Coupled with recent bugs (rust-lang#7840) we believe in `std::sync::mpsc`, let's just not use that and use a custom queue type locally which should be amenable to a blocking push soon too.

@Mark-Simulacrum

Swap std::sync::mpsc channel with crossbeam_channel Hoping it closes #7840 r? @Mark-Simulacrum

[beta] Revert scalable jobserver. This reverts #7731, #7829, and #7836. This should prevent #7840 on beta. I feel more comfortable reverting this than merging #7844, since the impact is still unknown.

We don't need the complexity of most channels since this is not a performance sensitive part of Cargo, nor is it likely to be so any time soon. Coupled with recent bugs (rust-lang#7840) we believe in `std::sync::mpsc`, let's just not use that and use a custom queue type locally which should be amenable to a blocking push soon too.

ehuss added C-bug Category: bug P-high Priority: High labels Jan 28, 2020

matthiaskrgr mentioned this issue Jan 28, 2020

Memory leakage while compiling imgui-ext on nightly rust-lang/rust#68600

Closed

Mark-Simulacrum mentioned this issue Jan 28, 2020

Compiler crashes when resuming compilation rust-lang/rust#68602

Closed

spastorino mentioned this issue Jan 28, 2020

Swap std::sync::mpsc channel with crossbeam_channel #7844

Merged

nwn mentioned this issue Jan 28, 2020

"memory allocation of _ bytes failed" when compiling certain crates rust-lang/rust#68607

Closed

Mark-Simulacrum mentioned this issue Jan 29, 2020

Add an option to use LLD to link the compiler on Windows platforms rust-lang/rust#68623

Merged

alexcrichton mentioned this issue Jan 29, 2020

Replace std::sync::mpsc with a much simpler queue #7845

Closed

bors added a commit that referenced this issue Jan 30, 2020

Auto merge of #7844 - spastorino:fix-infinite-loop, r=alexcrichton

4fbd644

Swap std::sync::mpsc channel with crossbeam_channel Hoping it closes #7840 r? @Mark-Simulacrum

bors closed this as completed in 20ddff8 Jan 30, 2020

bors closed this as completed in #7844 Jan 30, 2020

ehuss mentioned this issue Jan 31, 2020

[beta] Revert scalable jobserver. #7848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cargo is entering an infinite loop #7840

Cargo is entering an infinite loop #7840

ehuss commented Jan 28, 2020

dtolnay commented Jan 28, 2020 •

edited

Loading

alexcrichton commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

dtolnay commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

Mark-Simulacrum commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

Mark-Simulacrum commented Jan 28, 2020

ehuss commented Jan 28, 2020

sourcefrog commented Jan 28, 2020

mati865 commented Jan 28, 2020

spastorino commented Jan 28, 2020 •

edited

Loading

nagisa commented Jan 28, 2020

mati865 commented Jan 28, 2020

asomers commented Jan 28, 2020

huxi commented Jan 29, 2020

Cargo is entering an infinite loop #7840

Cargo is entering an infinite loop #7840

Comments

ehuss commented Jan 28, 2020

dtolnay commented Jan 28, 2020 • edited Loading

alexcrichton commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

dtolnay commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

Mark-Simulacrum commented Jan 28, 2020

alexcrichton commented Jan 28, 2020

Mark-Simulacrum commented Jan 28, 2020

ehuss commented Jan 28, 2020

sourcefrog commented Jan 28, 2020

mati865 commented Jan 28, 2020

spastorino commented Jan 28, 2020 • edited Loading

nagisa commented Jan 28, 2020

mati865 commented Jan 28, 2020

asomers commented Jan 28, 2020

huxi commented Jan 29, 2020

dtolnay commented Jan 28, 2020 •

edited

Loading

spastorino commented Jan 28, 2020 •

edited

Loading