-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cargo executed over SSH gets stuck when its connection is dropped #8714
Comments
Even with a separate build directory, one cargo process managed to get stuck when its SSH connection was dropped: pstree output:
Command line:
From FD 5, you can read this once:
Other FDs don't emit anyhing. Parent process backtrace:
Parent process has same command line, here is its fds, same as the child process:
|
Thanks for the report! I personally primarily work by ssh'ing into a computer somewhere else, and for years my compilations will randomly get stuck and I've got to ctrl-c to make any progress. I've always written it off as weirdness, but I wonder if this is related! This issue was first about concurrent Cargo processes, but that seems like it's no longer the case? You're able to get Cargo stuck by just killing an ssh connection? So are you effectively starting a build, killing the SSH connection (e.g. via kill?), and then Cargo hangs and doesn't make any progress? |
SSH connection is started from some CircleCI worker. Relevant config line is at https://github.com/deltachat/deltachat-core-rust/blob/7ddf3ba754068dc80f67c2f36433fb1c825409be/.circleci/config.yml#L145 SSH connection from withing the bash script started at CircleCI: https://github.com/deltachat/deltachat-core-rust/blob/7ddf3ba754068dc80f67c2f36433fb1c825409be/ci_scripts/remote_tests_rust.sh#L19 When CircleCI cancels the job, it probably kills bash and ssh on its side, I don't know how exactly it happens. Maybe it's not even kill, but some virtual network interface is removed from docker bridge or something like this. But |
Hm so for the last stack trace you posted has two stuck threads, one is the "jobserver helper thread" which just infinitely waits for tokens (thread 2) so that's expected to be blocked. The first thread, however, is the main event loop of Cargo as it waits for messages from subprocesses to handle. The lack of remaining threads, however, means that Cargo's not actually waiting on anything since everything exited. The only thing I can guess is that Cargo is handling a signal from the shell that nothing else is. My best guess for this is Would it be possible to set up some more debugging on your end? Ideally you'd set |
The $ clear; rm -rf target/ && cargo build 2>&1 | head -n 1
Compiling libc v0.2.86
# hangs here
$ pidof cargo
130492
$ pidof rustc
# only one cargo process and no rustc processes
$ rust-gdb --pid "$(pidof cargo)"
(gdb) thread apply all bt
Thread 2 (LWP 105180 "cargo"):
#0 0x00007fc2ac530aaa in __futex_abstimed_wait_common64 () from /lib64/libpthread.so.0
#1 0x00007fc2ac52a270 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2 0x000055a0c3a62e71 in jobserver::HelperState::for_each_request ()
#3 0x000055a0c3a6318c in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4 0x000055a0c3a63c4f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5 0x000055a0c3aaa41a in alloc::boxed::{{impl}}::call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#6 alloc::boxed::{{impl}}::call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#7 std::sys::unix::thread::{{impl}}::new::thread_start () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/thread.rs:71
#8 0x00007fc2ac524299 in start_thread () from /lib64/libpthread.so.0
#9 0x00007fc2ac300af3 in clone () from /lib64/libc.so.6
Thread 1 (LWP 105166 "cargo"):
#0 0x00007fc2ac530aaa in __futex_abstimed_wait_common64 () from /lib64/libpthread.so.0
#1 0x00007fc2ac52a584 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2 0x000055a0c3aa33c8 in std::sys::unix::condvar::Condvar::wait_timeout () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/condvar.rs:98
#3 0x000055a0c338ea77 in std::sync::condvar::Condvar::wait_timeout_while ()
#4 0x000055a0c321f821 in cargo::util::queue::Queue<T>::pop ()
#5 0x000055a0c33d5903 in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#6 0x000055a0c338ecd3 in std::panic::catch_unwind ()
#7 0x000055a0c32fdafe in crossbeam_utils::thread::scope ()
#8 0x000055a0c33d3856 in cargo::core::compiler::job_queue::JobQueue::execute ()
#9 0x000055a0c3285a15 in cargo::core::compiler::context::Context::compile ()
#10 0x000055a0c350dd01 in cargo::ops::cargo_compile::compile_ws ()
#11 0x000055a0c350da5e in cargo::ops::cargo_compile::compile ()
#12 0x000055a0c317f67d in cargo::commands::build::exec ()
#13 0x000055a0c31219f9 in cargo::cli::main ()
#14 0x000055a0c3189438 in cargo::main ()
#15 0x000055a0c3178333 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#16 0x000055a0c3178359 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::ha5e70e382477cfc4 ()
#17 0x000055a0c3aa1177 in core::ops::function::impls::{{impl}}::call_once<(),Fn<()>> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/ops/function.rs:259
#18 std::panicking::try::do_call<&Fn<()>,i32> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:379
#19 std::panicking::try<i32,&Fn<()>> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:343
#20 std::panic::catch_unwind<&Fn<()>,i32> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panic.rs:396
#21 std::rt::lang_start_internal () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/rt.rs:51
#22 0x000055a0c318bc82 in main () This is with:
... on Linux. Before I attached the debugger I thought it might be triggered by cargo unable to write to stdout. My actual use case is to run But given none of the threads are blocked on anything related to writing to stdout, I'm not sure that's the reason. |
I strace'd the thing along with rm -rf target/ && CARGO_LOG=trace strace -fo ~/foo/cargo.log -s 9999 -- ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/cargo build 2>&1 | head -n1 Near the end, this gives:
The futex syscalls then repeat forever. So the traces it wrote there are:
|
@Arnavion Thanks for the info! I see what is wrong, and I'll post a fix shortly. There is protection for handling errors in this case, and I missed one |
Posted #9201 to resolve the hang error. I'm skeptical it addresses the issue originally raised in this issue, but it is hard to say without a specific reproduction. |
Fix hang on broken stderr. If stderr is closed and cargo tries to print a status message such as "Compiling", cargo can get into a hung state. This is because the `DrainState::run` function would insert the job into the `active` queue, and then return an error without starting the job. Since the job isn't started, there is nothing to remove it from the `active` queue, and thus cargo hangs forever waiting for it to finish. The solution is to move the call to `note_working_on` earlier before the job is placed into the active queue. This addresses the issue noted in #8714 (comment).
I run cargo in Concourse CI now and sometimes it gets stuck without using CPU if nobody reads the logs. If I watch the log with But overall maybe it makes sense to consistently go through all read/write operations that call read/write syscalls and make sure all of them have some sort of timeout. E.g. if cargo gets stuck trying to write to stdout for 120 seconds, it better crashes with an error. Same for these pipes with characters between processes, if the pipe is full for too long, better crash and go fix the size of the buffer. |
This is an old issue, if it is back to happening please open a new issue. |
It still looks like the same issue to me: if nobody reads This issue is not closed, so I don't think there is a need for a new issue like this one. |
CircleCI at https://github.com/deltachat/deltachat-core-rust/ is set up to run this process remotely via SSH on another CI server:
with a target directory set via
CARGO_TARGET_DIR
per-branch.You can see the script at https://github.com/deltachat/deltachat-core-rust/blob/master/ci_scripts/remote_tests_rust.sh for reference.
When CI job is cancelled, e.g. because someone force-pushed to the branch, SSH connection to remote CI server is closed.
cargo
keeps running, and holds the lock.Edit: read the second post, the problem appears even after switching to per-build target directories, so it should be easier to understand what happens. There is only one
cargo
command executed in the target directory, yet it gets stuck.Then another CI job start another cargo process 30694 and it deadlocks too, holding two locks:
At this point, locks look like this:
Then I killed 30694, to release the lock /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock
After than, first cargo process 9697 opened the second lock file:
I connected to 9697 and collected backtraces:
So it looks like two concurrent cargo processes trying to get two locks concurrently locked each other. But after killing the second process, first process remains locked. I have collected backtraces, maybe it is possible to infer from them the state of this process and why it can't proceed.
I have found a similar closed issue #7200
The text was updated successfully, but these errors were encountered: