Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: print a backtrace on stackoverflow #133170

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joboet
Copy link
Member

@joboet joboet commented Nov 18, 2024

Since backtrace requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this uses libunwind and dladdr, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a test for unwinding in signal handlers, and dladdr is used by backtrace_symbols_fd in glibc, which it documents as async-signal-safe.

In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the worst thing to happen considering that a stack overflow is already an unrecoverable error and most likely indicates a bug.

Fixes #51405

@rustbot
Copy link
Collaborator

rustbot commented Nov 18, 2024

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 18, 2024
@rust-log-analyzer

This comment has been minimized.

Since `backtrace` requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this uses `libunwind` and `dladdr`, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a [test] for unwinding in signal handlers, and `dladdr` is used by `backtrace_symbols_fd`
in glibc, which it [documents] as async-signal-safe.

In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the *worst* thing to happen considering that a stack overflow is already an unrecoverable error and most likely
indicates a bug.

Fixes rust-lang#51405

[test]: https://github.com/llvm/llvm-project/blob/a6385a3fc8a88f092d07672210a1e773481c2919/libunwind/test/signal_unwind.pass.cpp
[documents]: https://www.gnu.org/software/libc/manual/html_node/Backtraces.html#index-backtrace_005fsymbols_005ffd
@joboet joboet force-pushed the stack_overflow_backtrace branch from 09b79d7 to 7c5af8e Compare November 18, 2024 12:48
/// some other platforms). Realistically, the worst thing that can happen is that
/// the stack overflow occurred inside the dynamic loaded while it holds some sort
/// of lock, which could result in a deadlock if that happens in just the right
/// moment. That's unlikely enough and not the *worst* thing to happen considering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A deadlock may cause a service to hang for an extended period of time until a health check considers the service dead and forcefully restarts it rather than getting restarted immediately upon crashing.

Copy link
Member

@bjorn3 bjorn3 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it could prevent a crash reporter from doing it's job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should have worded this differently. What I meant was just... the whole stack-overflow module is horribly unsound. Leaving aside the fact that signal handling itself isn't really specified in Rust anyway right now, this module contains awful crimes like mprotecting part of the current stack and using the very much non-AS-safe TLS system in signal handlers. Unfortunately, these hacks are required to ensure that stack allocation itself can't cause UB and to make any attempt at stack overflow messages at all. So while normally, I'm all for sound code, in this module, "does this work well enough" is what counts, not "is this sound", because it very much is not.

The examples you point out are totally realistic. In fact, anything can occur, this is unsound after all. But because of the unlikelihood of actual misbehaviour, I assume that the crimes I committed here are acceptable. If not, there may be ways to work around the misbehaviour (for instance by doing sigalarm and setting a deadline for the backtrace generation to prevent deadlocks). But I'd rather not commit further crimes until we discover any such misbehaviour.

@joboet
Copy link
Member Author

joboet commented Nov 25, 2024

Hey T-libs! How do you feel about this? Is it "safe enough" to call dladdr and _Unwind_Backtrace in signal handlers?

@rustbot label +I-libs-nominated

@rustbot rustbot added the I-libs-nominated Nominated for discussion during a libs team meeting. label Nov 25, 2024

let mut count = 0usize;
unsafe { unwind::_Unwind_Backtrace(frame, ptr::from_mut(&mut count).cast()) };
if count > 128 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count will never be > 129 so the number of omitted frames is going to be incorrect anyways. It's going to be a huge number anyways since we are handling a stack overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count will never be > 129

Yes, it will be?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I though the frame function would stop searching after hitting the depth limit but I checked the docs and _URC_NO_REASON will cause it to continue unwinding. I think it should stop at this point since it's possible in certain cases to get "infinite" backtraces with incorrect unwind info (we've had LLVM bugs that caused this before). We definitely want to stop after hitting a certain depth limit, and the total number of frames isn't actually important since stack overflow are often caused by infinite recursion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough... I'll change it so it just prints that some frames were omitted.

@cuviper
Copy link
Member

cuviper commented Nov 27, 2024

How will this behave with a hostile/broken backtrace like #123733?

@codonell
Copy link

As a libc developer (glibc), I want to make three general comments here.

First, this PR lists a test as an example of unwinding from a signal handler as justification for using asynchronous-signal unsafe functions as being acceptable. Calling any asynchronous signal-unsafe (AS-unsafe) function from a signal handler is allowed if the signal is raised synchronously, since doing so does not cause you to enter asynchronous signal context. The test you quote is an example of using kill to deliver a signal to yourself in a synchronous fashion. This signal frame can be unwound safely because nothing else was running. Thus the justification for the use is not correct and probably need a different example.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

Lastly I will note that we removed very similar code from glibc years ago because the worst thing that can happen is for there to be an attacker exploitable defect in this code. When you receive a SIGSGEV the best possible option is to get to exit as quickly as possible for two reasons (a) it reduce service restart latency and downtime, (b) it reduces the number of instructions that could be exploited by an attacker. Out-of-process backtraces are always the recommended solution today in the industry. Safe-by-default is not having these options present for in-process backtraces in the core libraries.

@Amanieu
Copy link
Member

Amanieu commented Nov 28, 2024

Swift seems to perform out-of-process backtraces on crash. This is done by spawning a swift-backtrace helper program from the signal handler, which will then ptrace the original program and remotely extract a backtrace.

@joboet
Copy link
Member Author

joboet commented Nov 28, 2024

As a libc developer (glibc), I want to make three general comments here.

Thank you for your input, it feels very good to have someone from glibc looking at this!

First, this PR lists a test as an example of unwinding from a signal handler as justification for using asynchronous-signal unsafe functions as being acceptable. Calling any asynchronous signal-unsafe (AS-unsafe) function from a signal handler is allowed if the signal is raised synchronously, since doing so does not cause you to enter asynchronous signal context. The test you quote is an example of using kill to deliver a signal to yourself in a synchronous fashion. This signal frame can be unwound safely because nothing else was running. Thus the justification for the use is not correct and probably need a different example.

Fair enough. What the example also demonstrates though is that is it possible to unwind from a signal context in general, which is a necessary property here.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

I know, this is firmly in "Hyrum's law"-territory. Not that it really justifies trespassing further into that, but we are already inside it: we assume that TLS accesses are AS-safe, even though C++ clearly states that they are not (see support.signal in the spec).

Lastly I will note that we removed very similar code from glibc years ago because the worst thing that can happen is for there to be an attacker exploitable defect in this code. When you receive a SIGSGEV the best possible option is to get to exit as quickly as possible for two reasons (a) it reduce service restart latency and downtime, (b) it reduces the number of instructions that could be exploited by an attacker.

Note that we only print the backtrace if the SIGSEGV was in all likelihood the result of a stack overflow, so this isn't as exploitable by things like buffer overflows or stuff like that (obviously it doesn't decrease the risk). Our problem as std maintainers is that SIGSEGVs are very rare in Rust as a result of its memory safety, so users (particularly the less-experienced ones) would be very surprised when they encounter one as a result of a stack overflow. They'd probably incorrectly assume that they managed to encounter a memory safety issue when in reality stack overflows are a totally memory-safe thing to happen, even if it is undesired. That's why we bother with all this signal handling stuff in the first place, and commit terrible crimes (e.g. using TLS to get the address of the stack guard page1) to provide at least some indication of what happened.

Out-of-process backtraces are always the recommended solution today in the industry. Safe-by-default is not having these options present for in-process backtraces in the core libraries.

Swift seems to perform out-of-process backtraces on crash. This is done by spawning a swift-backtrace helper program from the signal handler, which will then ptrace the original program and remotely extract a backtrace.

It's probably already too late for us in that regard, our panic handlers print a backtrace, and removing that would be ... controversial. The problem with out-of-process error reporters is that most people don't bother to use one. We don't have the luxury that Swift has of shipping one by default, since Rust binaries must not require the installation of a runtime. I agree though that this would be a better solution as our current error reporting has quite a few problems itself – e.g. more than half of the size of a Rust hello world binary is related to backtracing support. So perhaps there are ways to at least make it easier to switch to other infrastructure, like making the whole error reporting machinery configurable like the panic behaviour. But that still leaves the question of what we can do for in-process error reporting, and backtracing would definitely be nice to have.

Footnotes

  1. There might be a way around that, because IIRC LLVM currently ensures that the memory fault on stack overflow is less than one page away from the current stack pointer, so we could retrieve that instead. Unfortunately, accessing the stack pointer is done differently on nearly every single platform, so this would be awful to maintain.

@joboet
Copy link
Member Author

joboet commented Nov 28, 2024

What swift-backtrace shows however is that you can – at least on Linux – spawn threads from a signal handler (using clone). So if the hack here is deemed to risky, we could spawn a thread instead and do the unwinding/symbolification there instead. The worst thing that can happen in another thread is a deadlock (assuming we don't call into libc, I don't think it interacts nicely with thread not spawned with pthread).

@codonell
Copy link

Fair enough. What the example also demonstrates though is that is it possible to unwind from a signal context in general, which is a necessary property here.

Agreed.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

I know, this is firmly in "Hyrum's law"-territory. Not that it really justifies trespassing further into that, but we are already inside it: we assume that TLS accesses are AS-safe, even though C++ clearly states that they are not (see support.signal in the spec).

Please note that first TLS accesses are not AS-safe either, since __tls_get_addr et.al. allocates memory on first access so we actually have AS-unsafe first-access of TLS variables in glibc today. This is an implementation defect in glibc though and should be fixed, but I call it out in all fairness to highlight the difficulties faced by the implementation. The difficulty in fixing this quirk is that you would have to allocate all TLS variables up front for all threads at dlopen time, and in the early implementation this was an unacceptable amount of memory to allocate for robustness with respect to first access in an asynchronous-signal handler. Today I would say this was a mistake we need to correct by having dlopen carry out the upfront allocations.

Note that we only print the backtrace if the SIGSEGV was in all likelihood the result of a stack overflow, so this isn't as exploitable by things like buffer overflows or stuff like that (obviously it doesn't decrease the risk). Our problem as std maintainers is that SIGSEGVs are very rare in Rust as a result of its memory safety, so users (particularly the less-experienced ones) would be very surprised when they encounter one as a result of a stack overflow. They'd probably incorrectly assume that they managed to encounter a memory safety issue when in reality stack overflows are a totally memory-safe thing to happen, even if it is undesired. That's why we bother with all this signal handling stuff in the first place, and commit terrible crimes (e.g. using TLS to get the address of the stack guard page1) to provide at least some indication of what happened.

That is your choice to make :-)

It's probably already too late for us in that regard, our panic handlers print a backtrace, and removing that would be ... controversial. The problem with out-of-process error reporters is that most people don't bother to use one. We don't have the luxury that Swift has of shipping one by default, since Rust binaries must not require the installation of a runtime. I agree though that this would be a better solution as our current error reporting has quite a few problems itself – e.g. more than half of the size of a Rust hello world binary is related to backtracing support. So perhaps there are ways to at least make it easier to switch to other infrastructure, like making the whole error reporting machinery configurable like the panic behaviour. But that still leaves the question of what we can do for in-process error reporting, and backtracing would definitely be nice to have.

The general problem is that unwinder bugs now become runtime CVEs if they can be chained in an exploit with a SIGSEGV.

Just for reference from our glibc NEWS this isn't a hypothetical problem we have had real CVEs here:

  CVE-2020-1751: A defect in the PowerPC backtrace function could cause an
  out-of-bounds write when executed in a signal frame context.

* Avoid printing a backtrace from the __stack_chk_fail function since it is
  called on a corrupt stack and a backtrace is unreliable on a corrupt stack
  (CVE-2010-3192).

* On ARM EABI (32-bit), generating a backtrace for execution contexts which
  have been created with makecontext could fail to terminate due to a
  missing .cantunwind annotation.  This has been observed to lead to a hang
  (denial of service) in some Go applications compiled with gccgo.  Reported
  by Andreas Schwab.  (CVE-2016-6323)

I'm aware the gfortran uses libbacktrace for similar purposes, but libbacktrace has a lot of "restrictions" in this area too (https://github.com/ianlancetaylor/libbacktrace).

1. There _might_ be a way around that, because IIRC LLVM currently ensures that the memory fault on stack overflow is less than one page away from the current stack pointer, so we could retrieve that instead. Unfortunately, accessing the stack pointer is done differently on nearly _every single_ platform, so this would be _awful_ to maintain. [↩](#user-content-fnref-1-36cd5f5127ce0c4f039b1fefe60693aa)
jmp_buf jmp;
setjmp(jmp);

And print the known register?

@bjorn3
Copy link
Member

bjorn3 commented Nov 28, 2024

jmp_buf jmp;
setjmp(jmp);

And print the known register?

The signal handler runs on a separate stack. It has to as we don't have any remaining stack space on the main stack of the thread due to the stack overflow. As such I think we did have to use code that is both OS and architecture dependent to load the stack pointer from the location where the OS saves all registers before calling the signal handler.

@joboet
Copy link
Member Author

joboet commented Nov 28, 2024

Please note that first TLS accesses are not AS-safe either

Absolutely! Our current stack overflow hack works because we initialize the TLS variables containing the guard page range before we setup the sigaltstack, so when the signal handler is called (it is only called successfully when an alternate signal stack was installed, since the default stack is already full), it won't be the first access. That's assuming that no user code does a sigaltstack without also changing the signal handler – that's one of our implicit runtime assumptions.1 Still, we rely on undocumented behaviour, as AS-safety isn't guaranteed for second TLS accesses.

The general problem is that unwinder bugs now become runtime CVEs if they can be chained in an exploit with a SIGSEGV.

Just for reference from our glibc NEWS this isn't a hypothetical problem we have had real CVEs here:

The __stack_chk_fail one is not relevant here, that function is used for stack corruption, which can only result from unsoundness. Stack overflows are not a memory safety issue however, and nothing is corrupted – the thread is just frozen in time until we return from the signal handler. So once we categorize the SIGSEGV as a stack overflow, we are not in a situation where we particularly need to worry about corrupted state.

By my reckoning,2 if an exploitable memory safety bug is bad enough to be able to cause a SIGSEGV, masquerade it as a stack overflow and trigger a bug in the stack overflow handler, the bug is bad enough that it can be used to do pretty much everything even without exploiting another bug.

jmp_buf jmp;
setjmp(jmp);

The problem isn't getting the mcontext_t (that's passed to us by the kernel), it's retrieving the stack pointer from it. Almost every platform does that in a different way.

Footnotes

  1. Oh man, we should really document those somewhere...

  2. Based on very limited experience with exploitable memory safety issues – it's just something we rarely get in Rustland. So please feel free to correct me!

@the8472
Copy link
Member

the8472 commented Jan 15, 2025

We discussed this during today's T-libs meeting. While we do want the stackoverflow debugging experience to improve we think this approach is on too shaky ground.

Publishing official documentation how to get stack traces with debuggers and changing the static message to include a link to that was discussed as a simpler alternative. The other alternative as already mentioned in this thread would be out-of-process backtraces.

@the8472
Copy link
Member

the8472 commented Jan 15, 2025

@rfcbot fcp close

@rfcbot
Copy link

rfcbot commented Jan 15, 2025

Team member @the8472 has proposed to close this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-close This PR / issue is in PFCP or FCP with a disposition to close it. labels Jan 15, 2025
@bors
Copy link
Contributor

bors commented Jan 15, 2025

☔ The latest upstream changes (presumably #135540) made this pull request unmergeable. Please resolve the merge conflicts.

@Amanieu Amanieu removed the I-libs-nominated Nominated for discussion during a libs team meeting. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disposition-close This PR / issue is in PFCP or FCP with a disposition to close it. O-unix Operating system: Unix-like proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Great stack overflow error messages