-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std: print a backtrace on stackoverflow #133170
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
Since `backtrace` requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this uses `libunwind` and `dladdr`, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a [test] for unwinding in signal handlers, and `dladdr` is used by `backtrace_symbols_fd` in glibc, which it [documents] as async-signal-safe. In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the *worst* thing to happen considering that a stack overflow is already an unrecoverable error and most likely indicates a bug. Fixes rust-lang#51405 [test]: https://github.com/llvm/llvm-project/blob/a6385a3fc8a88f092d07672210a1e773481c2919/libunwind/test/signal_unwind.pass.cpp [documents]: https://www.gnu.org/software/libc/manual/html_node/Backtraces.html#index-backtrace_005fsymbols_005ffd
09b79d7
to
7c5af8e
Compare
/// some other platforms). Realistically, the worst thing that can happen is that | ||
/// the stack overflow occurred inside the dynamic loaded while it holds some sort | ||
/// of lock, which could result in a deadlock if that happens in just the right | ||
/// moment. That's unlikely enough and not the *worst* thing to happen considering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A deadlock may cause a service to hang for an extended period of time until a health check considers the service dead and forcefully restarts it rather than getting restarted immediately upon crashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it could prevent a crash reporter from doing it's job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should have worded this differently. What I meant was just... the whole stack-overflow module is horribly unsound. Leaving aside the fact that signal handling itself isn't really specified in Rust anyway right now, this module contains awful crimes like mprotect
ing part of the current stack and using the very much non-AS-safe TLS system in signal handlers. Unfortunately, these hacks are required to ensure that stack allocation itself can't cause UB and to make any attempt at stack overflow messages at all. So while normally, I'm all for sound code, in this module, "does this work well enough" is what counts, not "is this sound", because it very much is not.
The examples you point out are totally realistic. In fact, anything can occur, this is unsound after all. But because of the unlikelihood of actual misbehaviour, I assume that the crimes I committed here are acceptable. If not, there may be ways to work around the misbehaviour (for instance by doing sigalarm
and setting a deadline for the backtrace generation to prevent deadlocks). But I'd rather not commit further crimes until we discover any such misbehaviour.
Hey T-libs! How do you feel about this? Is it "safe enough" to call @rustbot label +I-libs-nominated |
|
||
let mut count = 0usize; | ||
unsafe { unwind::_Unwind_Backtrace(frame, ptr::from_mut(&mut count).cast()) }; | ||
if count > 128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count will never be > 129 so the number of omitted frames is going to be incorrect anyways. It's going to be a huge number anyways since we are handling a stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count will never be > 129
Yes, it will be?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I though the frame
function would stop searching after hitting the depth limit but I checked the docs and _URC_NO_REASON
will cause it to continue unwinding. I think it should stop at this point since it's possible in certain cases to get "infinite" backtraces with incorrect unwind info (we've had LLVM bugs that caused this before). We definitely want to stop after hitting a certain depth limit, and the total number of frames isn't actually important since stack overflow are often caused by infinite recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair enough... I'll change it so it just prints that some frames were omitted.
How will this behave with a hostile/broken backtrace like #123733? |
As a libc developer (glibc), I want to make three general comments here. First, this PR lists a test as an example of unwinding from a signal handler as justification for using asynchronous-signal unsafe functions as being acceptable. Calling any asynchronous signal-unsafe (AS-unsafe) function from a signal handler is allowed if the signal is raised synchronously, since doing so does not cause you to enter asynchronous signal context. The test you quote is an example of using Second, yes, Lastly I will note that we removed very similar code from glibc years ago because the worst thing that can happen is for there to be an attacker exploitable defect in this code. When you receive a SIGSGEV the best possible option is to get to |
Swift seems to perform out-of-process backtraces on crash. This is done by spawning a |
Thank you for your input, it feels very good to have someone from glibc looking at this!
Fair enough. What the example also demonstrates though is that is it possible to unwind from a signal context in general, which is a necessary property here.
I know, this is firmly in "Hyrum's law"-territory. Not that it really justifies trespassing further into that, but we are already inside it: we assume that TLS accesses are AS-safe, even though C++ clearly states that they are not (see
Note that we only print the backtrace if the SIGSEGV was in all likelihood the result of a stack overflow, so this isn't as exploitable by things like buffer overflows or stuff like that (obviously it doesn't decrease the risk). Our problem as
It's probably already too late for us in that regard, our panic handlers print a backtrace, and removing that would be ... controversial. The problem with out-of-process error reporters is that most people don't bother to use one. We don't have the luxury that Swift has of shipping one by default, since Rust binaries must not require the installation of a runtime. I agree though that this would be a better solution as our current error reporting has quite a few problems itself – e.g. more than half of the size of a Rust hello world binary is related to backtracing support. So perhaps there are ways to at least make it easier to switch to other infrastructure, like making the whole error reporting machinery configurable like the panic behaviour. But that still leaves the question of what we can do for in-process error reporting, and backtracing would definitely be nice to have. Footnotes
|
What |
Agreed.
Please note that first TLS accesses are not AS-safe either, since
That is your choice to make :-)
The general problem is that unwinder bugs now become runtime CVEs if they can be chained in an exploit with a SIGSEGV. Just for reference from our glibc NEWS this isn't a hypothetical problem we have had real CVEs here:
I'm aware the gfortran uses libbacktrace for similar purposes, but libbacktrace has a lot of "restrictions" in this area too (https://github.com/ianlancetaylor/libbacktrace).
And print the known register? |
The signal handler runs on a separate stack. It has to as we don't have any remaining stack space on the main stack of the thread due to the stack overflow. As such I think we did have to use code that is both OS and architecture dependent to load the stack pointer from the location where the OS saves all registers before calling the signal handler. |
Absolutely! Our current stack overflow hack works because we initialize the TLS variables containing the guard page range before we setup the
The By my reckoning,2 if an exploitable memory safety bug is bad enough to be able to cause a SIGSEGV, masquerade it as a stack overflow and trigger a bug in the stack overflow handler, the bug is bad enough that it can be used to do pretty much everything even without exploiting another bug.
The problem isn't getting the Footnotes |
We discussed this during today's T-libs meeting. While we do want the stackoverflow debugging experience to improve we think this approach is on too shaky ground. Publishing official documentation how to get stack traces with debuggers and changing the static message to include a link to that was discussed as a simpler alternative. The other alternative as already mentioned in this thread would be out-of-process backtraces. |
@rfcbot fcp close |
Team member @the8472 has proposed to close this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
☔ The latest upstream changes (presumably #135540) made this pull request unmergeable. Please resolve the merge conflicts. |
Since
backtrace
requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this useslibunwind
anddladdr
, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a test for unwinding in signal handlers, anddladdr
is used bybacktrace_symbols_fd
in glibc, which it documents as async-signal-safe.In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the worst thing to happen considering that a stack overflow is already an unrecoverable error and most likely indicates a bug.
Fixes #51405