Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Unix-like OSes, std::process::Command::spawn() should not use assert! from the child process after fork() #73894

Closed
ghost opened this issue Jun 30, 2020 · 29 comments · Fixed by #77328
Labels
C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@ghost
Copy link

ghost commented Jun 30, 2020

On Unix-like OSes, std::process::Command::spawn() uses an assert! to check if the pipe I/O succeeded. If that assert! fails, it will call panic!, which is not signal-safe. It's better to use if ....is_err() { std::intrinsics::abort() } there.
libc::abort() (std::process::abort()) shouldn't be used as a replacement, as, at least in glibc, it's not signal-safe[1][2], although all of the C standard, the C++ standard, the POSIX standard, and Linux man-pages say it should be.

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/abort.c;h=df98782dd7ea6c1476184a365bd9f3954f481a18;hb=refs/heads/master#l54
[2] https://www.gnu.org/software/libc/manual/html_node/Aborting-a-Program.html#Aborting-a-Program

@jonas-schievink jonas-schievink added C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 30, 2020
@ghost

This comment has been minimized.

@ghost

This comment has been minimized.

@ghost ghost closed this as completed Jul 17, 2020
@mati865
Copy link
Contributor

mati865 commented Jul 17, 2020

cc @joshtriplett

@joshtriplett joshtriplett added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 20, 2020
@joshtriplett
Copy link
Member

This is a good catch. We certainly try to be signal-safe in such contexts, and failing to do so is a bug. Sorry to see that this didn't get a response. I'm reopening this, as it remains a bug.

@joshtriplett joshtriplett reopened this Jul 20, 2020
@ghost
Copy link
Author

ghost commented Jul 20, 2020

I discovered that if libc::abort() is not signal-safe (in glibc), then rtabort!() is not signal-safe, so the signal handlers in the standard library that call rtabort!() is actually not signal-safe:

unsafe extern "C" fn signal_handler(
signum: libc::c_int,
info: *mut libc::siginfo_t,
_data: *mut libc::c_void,
) {
use crate::sys_common::util::report_overflow;
let guard = thread_info::stack_guard().unwrap_or(0..0);
let addr = siginfo_si_addr(info);
// If the faulting address is within the guard page, then we print a
// message saying so and abort.
if guard.start <= addr && addr < guard.end {
report_overflow();
rtabort!("stack overflow");

Maybe I should open a new issue or report the abort(3) bug to glibc...

@joshtriplett
Copy link
Member

joshtriplett commented Jul 21, 2020 via email

@ghost
Copy link
Author

ghost commented Jul 22, 2020

On Mon, Jul 20, 2020 at 01:32:22AM -0700, hyd-dev wrote:
I just realized if libc::abort() is not signal-safe (in glibc), then rtabort!() is not signal-safe, so the signal handlers in the standard library that call rtabort!() is actually not signal-safe:

unsafe extern "C" fn signal_handler(
signum: libc::c_int,
info: *mut libc::siginfo_t,
_data: *mut libc::c_void,
) {
use crate::sys_common::util::report_overflow;
let guard = thread_info::stack_guard().unwrap_or(0..0);
let addr = siginfo_si_addr(info);
// If the faulting address is within the guard page, then we print a
// message saying so and abort.
if guard.start <= addr && addr < guard.end {
report_overflow();
rtabort!("stack overflow");
Maybe I should open a new issue or report the abort(3) bug to glibc.

signal-safety(7) lists abort(3) as async-signal-safe. If it isn't in practice async-signal-safe in glibc, that's a bug and you should report it.

Reported as https://sourceware.org/bugzilla/show_bug.cgi?id=26275.

@ghost
Copy link
Author

ghost commented Jul 22, 2020

So I think libc::abort() can be used, and if ... { something_like_rtabort!() } is suitable as a replacement. WDYT?

@ghost
Copy link
Author

ghost commented Aug 11, 2020

So I think libc::abort() can be used, and if ... { something_like_rtabort!() } is suitable as a replacement. WDYT?

@joshtriplett Should this be fixed like that?

@joshtriplett
Copy link
Member

Given the above research, I don't think libc::abort() is safe to use either.

@joshtriplett
Copy link
Member

Using std::intrinsics::abort() seems like a potentially reasonable approach here. (std::process::abort() just calls libc::abort().)

@Amanieu
Copy link
Member

Amanieu commented Aug 11, 2020

I would prefer just using rtabort! since that is precisely what it was designed for.

@cuviper
Copy link
Member

cuviper commented Aug 11, 2020

That call chain is rtabort! -> sys_common::util::abort() -> sys::abort_internal(), which on unix calls libc::abort().

// On Unix-like platforms, libc::abort will unregister signal handlers
// including the SIGABRT handler, preventing the abort from being blocked, and
// fclose streams, with the side effect of flushing them so libc buffered
// output will be printed. Additionally the shell will generally print a more
// understandable error message like "Abort trap" rather than "Illegal
// instruction" that intrinsics::abort would cause, as intrinsics::abort is
// implemented as an illegal instruction.
pub fn abort_internal() -> ! {
unsafe { libc::abort() }
}

@cuviper
Copy link
Member

cuviper commented Aug 11, 2020

The note about flushing files isn't something we can rely on:

https://man7.org/linux/man-pages/man3/abort.3.html

   Up until glibc 2.26, if the abort() function caused process
   termination, all open streams were closed and flushed (as with
   fclose(3)).  However, in some cases this could result in deadlocks
   and data corruption.  Therefore, starting with glibc 2.27, abort()
   terminates the process without flushing streams.  POSIX.1 permits
   either possible behavior, saying that abort() "may include an attempt
   to effect fclose() on all open streams".

@codonell
Copy link

In upstream glibc the goal is for abort() to be AS-safe as required by the standard and it should be today (after glibc 2.27). In fact doing anything superfluous during the abort() generally constitutes a security hazard because the process can be in an unknown state when corruption is detected and the abort() called. For the sake of reducing restart latency and avoiding executing any compromised parts of the process we shut down as quickly as possible. We strongly recommend out-of-process backtracing to avoid needing to run any of that code in the process with bad or unknown state.

@ghost
Copy link
Author

ghost commented Aug 12, 2020

In upstream glibc the goal is for abort() to be AS-safe as required by the standard and it should be today (after glibc 2.27).

It's not. It tries to hold a lock.

@tmiasko
Copy link
Contributor

tmiasko commented Aug 13, 2020

Example demonstrating a deadlock. After a dozen of so runs leaves deadlocked children behind:

#include <assert.h>
#include <pthread.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

void *lazy_abort(void *arg) {
  struct timespec req = {.tv_sec = 0, .tv_nsec = 1000000 };
  nanosleep(&req, NULL);
  abort();
  return NULL;
}

int main() {
  pthread_t thread;
  signal(SIGABRT, SIG_IGN);
  int r = pthread_create(&thread, NULL, lazy_abort, NULL);
  assert(r == 0);

  while (1) {
    pid_t pid = fork();
    assert(pid != -1);
    if (pid == 0) {
      abort();
    }
  }
}

At least the conditions for deadlock are quite unlikely, since both a child process and a thread in the parent process must abort.

The musl implementation looks correct (it also uses a lock, which by itself is only a weak indicator of a potential issue).

@Amanieu
Copy link
Member

Amanieu commented Aug 13, 2020

@tmiasko You may want to post that on the glibc bug report here: https://sourceware.org/bugzilla/show_bug.cgi?id=26275

@codonell
Copy link

@tmiasko I have a fix for glibc. Locks are basically required to avoid seeing the unmasked signal state in the child. However, you have to take that lock during fork to ensure that you own and can unlock the lock in the child (avoids the deadlock). The regression test was not straight forward to write and on average it only reproduces once every 100-600 seconds on an x86_64 VM with 4vCPU. That's good enough though and I run the test for 20s to probabilistically try to catch if the defect comes back. On average if I ran the regression test just once per dev-test cycle it might take something like 32 years to show the defect :-( Thankfully more people than just the dev-test cycle will be using the API.

@codonell
Copy link

@ghost
Copy link
Author

ghost commented Sep 28, 2020

Unfortunately, avoiding abort() may still be required after the patch lands, to wait for it to be released and included in GNU/Linux distributions.

@ghost
Copy link
Author

ghost commented Sep 28, 2020

I think we can make sys::abort_internal() call intrinsics::abort() when using glibc and then use rtabort! in spawn().

@Amanieu
Copy link
Member

Amanieu commented Sep 28, 2020

Please don't use intrinsics::abort. If you really want then raise SIGABRT instead.

@ghost
Copy link
Author

ghost commented Sep 28, 2020

@Amanieu
Copy link
Member

Amanieu commented Sep 28, 2020

It is: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

@ghost
Copy link
Author

ghost commented Sep 28, 2020

It is: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

Sorry. I confused. You're correct.

@ghost
Copy link
Author

ghost commented Sep 28, 2020

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

@Amanieu https://man7.org/linux/man-pages/man3/raise.3.html#DESCRIPTION says raise(sig) is equivalent to pthread_kill(pthread_self(), sig). I think pthread_sigmask(), sigaction() and just raise() can be used to implement the replacement.

I can submit a PR to do that.
@rustbot claim

@rustbot rustbot assigned ghost Sep 28, 2020
@cuviper
Copy link
Member

cuviper commented Sep 28, 2020

IMO we should just leave it as glibc's acknowledged bug, unless folks are actively hitting this problem. While we do hope to be technically "pure", the actual abort race seems like a very artificial situation that I doubt is a problem in practice.

@ghost
Copy link
Author

ghost commented Sep 28, 2020

IMO we should just leave it as glibc's acknowledged bug, unless folks are actively hitting this problem.

@cuviper @Amanieu I've found that if I have to avoid intrinsics::abort() and use libc::raise(), I almost need to reimplement libc::abort() and port its tests to std. Maybe it's really better to keep using libc::abort() and just replace the assert! with rtassert!.

While we do hope to be technically "pure", the actual abort race seems like a very artificial situation that I doubt is a problem in practice.

Fair. Also, thanks @codonell for fixing the race in glibc.

I can still submit a PR to just change the assert! to rtassert!. That assert! unsafety shouldn't be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
7 participants