std: print a backtrace on stackoverflow #133170

joboet · 2024-11-18T12:21:08Z

Since backtrace requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this uses libunwind and dladdr, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a test for unwinding in signal handlers, and dladdr is used by backtrace_symbols_fd in glibc, which it documents as async-signal-safe.

In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the worst thing to happen considering that a stack overflow is already an unrecoverable error and most likely indicates a bug.

Fixes #51405

rustbot · 2024-11-18T12:21:18Z

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Since `backtrace` requires locking and memory allocation, it cannot be used from inside a signal handler. Instead, this uses `libunwind` and `dladdr`, even though both of them are not guaranteed to be async-signal-safe, strictly speaking. However, at least LLVM's libunwind (used by macOS) has a [test] for unwinding in signal handlers, and `dladdr` is used by `backtrace_symbols_fd` in glibc, which it [documents] as async-signal-safe. In practice, this hack works well enough on GNU/Linux and macOS (and perhaps some other platforms in the future). Realistically, the worst thing that can happen is that the stack overflow occurred inside the dynamic loaded while it holds some sort of lock, which could result in a deadlock if that happens in just the right moment. That's unlikely enough and not the *worst* thing to happen considering that a stack overflow is already an unrecoverable error and most likely indicates a bug. Fixes rust-lang#51405 [test]: https://github.com/llvm/llvm-project/blob/a6385a3fc8a88f092d07672210a1e773481c2919/libunwind/test/signal_unwind.pass.cpp [documents]: https://www.gnu.org/software/libc/manual/html_node/Backtraces.html#index-backtrace_005fsymbols_005ffd

bjorn3 · 2024-11-19T08:28:56Z

library/std/src/sys/pal/unix/stack_overflow/backtrace.rs

+/// some other platforms). Realistically, the worst thing that can happen is that
+/// the stack overflow occurred inside the dynamic loaded while it holds some sort
+/// of lock, which could result in a deadlock if that happens in just the right
+/// moment. That's unlikely enough and not the *worst* thing to happen considering


A deadlock may cause a service to hang for an extended period of time until a health check considers the service dead and forcefully restarts it rather than getting restarted immediately upon crashing.

And it could prevent a crash reporter from doing it's job.

Yeah, I should have worded this differently. What I meant was just... the whole stack-overflow module is horribly unsound. Leaving aside the fact that signal handling itself isn't really specified in Rust anyway right now, this module contains awful crimes like mprotecting part of the current stack and using the very much non-AS-safe TLS system in signal handlers. Unfortunately, these hacks are required to ensure that stack allocation itself can't cause UB and to make any attempt at stack overflow messages at all. So while normally, I'm all for sound code, in this module, "does this work well enough" is what counts, not "is this sound", because it very much is not.

The examples you point out are totally realistic. In fact, anything can occur, this is unsound after all. But because of the unlikelihood of actual misbehaviour, I assume that the crimes I committed here are acceptable. If not, there may be ways to work around the misbehaviour (for instance by doing sigalarm and setting a deadline for the backtrace generation to prevent deadlocks). But I'd rather not commit further crimes until we discover any such misbehaviour.

joboet · 2024-11-25T10:52:14Z

Hey T-libs! How do you feel about this? Is it "safe enough" to call dladdr and _Unwind_Backtrace in signal handlers?

@rustbot label +I-libs-nominated

Amanieu · 2024-11-27T17:23:28Z

library/std/src/sys/pal/unix/stack_overflow/backtrace.rs

+
+    let mut count = 0usize;
+    unsafe { unwind::_Unwind_Backtrace(frame, ptr::from_mut(&mut count).cast()) };
+    if count > 128 {


Count will never be > 129 so the number of omitted frames is going to be incorrect anyways. It's going to be a huge number anyways since we are handling a stack overflow.

Count will never be > 129

Yes, it will be?!

Oh sorry I though the frame function would stop searching after hitting the depth limit but I checked the docs and _URC_NO_REASON will cause it to continue unwinding. I think it should stop at this point since it's possible in certain cases to get "infinite" backtraces with incorrect unwind info (we've had LLVM bugs that caused this before). We definitely want to stop after hitting a certain depth limit, and the total number of frames isn't actually important since stack overflow are often caused by infinite recursion.

Yeah, fair enough... I'll change it so it just prints that some frames were omitted.

It's going to be a huge number anyways since we are handling a stack overflow.

Most of the time, yep, but I recently got a stack overflow from allocating a very large array in a single stack frame. A backtrace would have been handy there!

library/std/src/sys/pal/unix/stack_overflow.rs

cuviper · 2024-11-27T21:06:49Z

How will this behave with a hostile/broken backtrace like #123733?

codonell · 2024-11-27T22:23:20Z

As a libc developer (glibc), I want to make three general comments here.

First, this PR lists a test as an example of unwinding from a signal handler as justification for using asynchronous-signal unsafe functions as being acceptable. Calling any asynchronous signal-unsafe (AS-unsafe) function from a signal handler is allowed if the signal is raised synchronously, since doing so does not cause you to enter asynchronous signal context. The test you quote is an example of using kill to deliver a signal to yourself in a synchronous fashion. This signal frame can be unwound safely because nothing else was running. Thus the justification for the use is not correct and probably need a different example.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

Lastly I will note that we removed very similar code from glibc years ago because the worst thing that can happen is for there to be an attacker exploitable defect in this code. When you receive a SIGSGEV the best possible option is to get to exit as quickly as possible for two reasons (a) it reduce service restart latency and downtime, (b) it reduces the number of instructions that could be exploited by an attacker. Out-of-process backtraces are always the recommended solution today in the industry. Safe-by-default is not having these options present for in-process backtraces in the core libraries.

Amanieu · 2024-11-28T00:47:27Z

Swift seems to perform out-of-process backtraces on crash. This is done by spawning a swift-backtrace helper program from the signal handler, which will then ptrace the original program and remotely extract a backtrace.

joboet · 2024-11-28T15:49:18Z

As a libc developer (glibc), I want to make three general comments here.

Thank you for your input, it feels very good to have someone from glibc looking at this!

First, this PR lists a test as an example of unwinding from a signal handler as justification for using asynchronous-signal unsafe functions as being acceptable. Calling any asynchronous signal-unsafe (AS-unsafe) function from a signal handler is allowed if the signal is raised synchronously, since doing so does not cause you to enter asynchronous signal context. The test you quote is an example of using kill to deliver a signal to yourself in a synchronous fashion. This signal frame can be unwound safely because nothing else was running. Thus the justification for the use is not correct and probably need a different example.

Fair enough. What the example also demonstrates though is that is it possible to unwind from a signal context in general, which is a necessary property here.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

I know, this is firmly in "Hyrum's law"-territory. Not that it really justifies trespassing further into that, but we are already inside it: we assume that TLS accesses are AS-safe, even though C++ clearly states that they are not (see support.signal in the spec).

Lastly I will note that we removed very similar code from glibc years ago because the worst thing that can happen is for there to be an attacker exploitable defect in this code. When you receive a SIGSGEV the best possible option is to get to exit as quickly as possible for two reasons (a) it reduce service restart latency and downtime, (b) it reduces the number of instructions that could be exploited by an attacker.

Note that we only print the backtrace if the SIGSEGV was in all likelihood the result of a stack overflow, so this isn't as exploitable by things like buffer overflows or stuff like that (obviously it doesn't decrease the risk). Our problem as std maintainers is that SIGSEGVs are very rare in Rust as a result of its memory safety, so users (particularly the less-experienced ones) would be very surprised when they encounter one as a result of a stack overflow. They'd probably incorrectly assume that they managed to encounter a memory safety issue when in reality stack overflows are a totally memory-safe thing to happen, even if it is undesired. That's why we bother with all this signal handling stuff in the first place, and commit terrible crimes (e.g. using TLS to get the address of the stack guard page¹) to provide at least some indication of what happened.

Out-of-process backtraces are always the recommended solution today in the industry. Safe-by-default is not having these options present for in-process backtraces in the core libraries.

Swift seems to perform out-of-process backtraces on crash. This is done by spawning a swift-backtrace helper program from the signal handler, which will then ptrace the original program and remotely extract a backtrace.

It's probably already too late for us in that regard, our panic handlers print a backtrace, and removing that would be ... controversial. The problem with out-of-process error reporters is that most people don't bother to use one. We don't have the luxury that Swift has of shipping one by default, since Rust binaries must not require the installation of a runtime. I agree though that this would be a better solution as our current error reporting has quite a few problems itself – e.g. more than half of the size of a Rust hello world binary is related to backtracing support. So perhaps there are ways to at least make it easier to switch to other infrastructure, like making the whole error reporting machinery configurable like the panic behaviour. But that still leaves the question of what we can do for in-process error reporting, and backtracing would definitely be nice to have.

There might be a way around that, because IIRC LLVM currently ensures that the memory fault on stack overflow is less than one page away from the current stack pointer, so we could retrieve that instead. Unfortunately, accessing the stack pointer is done differently on nearly every single platform, so this would be awful to maintain. ↩

joboet · 2024-11-28T16:47:07Z

What swift-backtrace shows however is that you can – at least on Linux – spawn threads from a signal handler (using clone). So if the hack here is deemed to risky, we could spawn a thread instead and do the unwinding/symbolification there instead. The worst thing that can happen in another thread is a deadlock (assuming we don't call into libc, I don't think it interacts nicely with thread not spawned with pthread).

codonell · 2024-11-28T17:57:19Z

Fair enough. What the example also demonstrates though is that is it possible to unwind from a signal context in general, which is a necessary property here.

Agreed.

Second, yes, dladdr is used by backtrace_symbols_from_fd but presuming that this makes dladdr AS-safe is not prudent to assume since we don't document dladdr as AS-safe because we don't want to restrict future implementations. Practically speaking though we do little more than take the recursive loader lock (to avoid concurrent dlopen) and walk some data structures. I'm just warning that dladdr is not marked AS-safe and so it may break this code in the future. In general this use is not robust unless the caller assures you AS-safety, and I expect they will not.

I know, this is firmly in "Hyrum's law"-territory. Not that it really justifies trespassing further into that, but we are already inside it: we assume that TLS accesses are AS-safe, even though C++ clearly states that they are not (see support.signal in the spec).

Please note that first TLS accesses are not AS-safe either, since __tls_get_addr et.al. allocates memory on first access so we actually have AS-unsafe first-access of TLS variables in glibc today. This is an implementation defect in glibc though and should be fixed, but I call it out in all fairness to highlight the difficulties faced by the implementation. The difficulty in fixing this quirk is that you would have to allocate all TLS variables up front for all threads at dlopen time, and in the early implementation this was an unacceptable amount of memory to allocate for robustness with respect to first access in an asynchronous-signal handler. Today I would say this was a mistake we need to correct by having dlopen carry out the upfront allocations.

Note that we only print the backtrace if the SIGSEGV was in all likelihood the result of a stack overflow, so this isn't as exploitable by things like buffer overflows or stuff like that (obviously it doesn't decrease the risk). Our problem as std maintainers is that SIGSEGVs are very rare in Rust as a result of its memory safety, so users (particularly the less-experienced ones) would be very surprised when they encounter one as a result of a stack overflow. They'd probably incorrectly assume that they managed to encounter a memory safety issue when in reality stack overflows are a totally memory-safe thing to happen, even if it is undesired. That's why we bother with all this signal handling stuff in the first place, and commit terrible crimes (e.g. using TLS to get the address of the stack guard page1) to provide at least some indication of what happened.

That is your choice to make :-)

It's probably already too late for us in that regard, our panic handlers print a backtrace, and removing that would be ... controversial. The problem with out-of-process error reporters is that most people don't bother to use one. We don't have the luxury that Swift has of shipping one by default, since Rust binaries must not require the installation of a runtime. I agree though that this would be a better solution as our current error reporting has quite a few problems itself – e.g. more than half of the size of a Rust hello world binary is related to backtracing support. So perhaps there are ways to at least make it easier to switch to other infrastructure, like making the whole error reporting machinery configurable like the panic behaviour. But that still leaves the question of what we can do for in-process error reporting, and backtracing would definitely be nice to have.

The general problem is that unwinder bugs now become runtime CVEs if they can be chained in an exploit with a SIGSEGV.

Just for reference from our glibc NEWS this isn't a hypothetical problem we have had real CVEs here:

  CVE-2020-1751: A defect in the PowerPC backtrace function could cause an
  out-of-bounds write when executed in a signal frame context.

* Avoid printing a backtrace from the __stack_chk_fail function since it is
  called on a corrupt stack and a backtrace is unreliable on a corrupt stack
  (CVE-2010-3192).

* On ARM EABI (32-bit), generating a backtrace for execution contexts which
  have been created with makecontext could fail to terminate due to a
  missing .cantunwind annotation.  This has been observed to lead to a hang
  (denial of service) in some Go applications compiled with gccgo.  Reported
  by Andreas Schwab.  (CVE-2016-6323)

I'm aware the gfortran uses libbacktrace for similar purposes, but libbacktrace has a lot of "restrictions" in this area too (https://github.com/ianlancetaylor/libbacktrace).

1. There _might_ be a way around that, because IIRC LLVM currently ensures that the memory fault on stack overflow is less than one page away from the current stack pointer, so we could retrieve that instead. Unfortunately, accessing the stack pointer is done differently on nearly _every single_ platform, so this would be _awful_ to maintain. [↩](#user-content-fnref-1-36cd5f5127ce0c4f039b1fefe60693aa)

jmp_buf jmp;
setjmp(jmp);

And print the known register?

bjorn3 · 2024-11-28T18:11:13Z

jmp_buf jmp;
setjmp(jmp);
And print the known register?

The signal handler runs on a separate stack. It has to as we don't have any remaining stack space on the main stack of the thread due to the stack overflow. As such I think we did have to use code that is both OS and architecture dependent to load the stack pointer from the location where the OS saves all registers before calling the signal handler.

joboet · 2024-11-28T22:55:47Z

Please note that first TLS accesses are not AS-safe either

Absolutely! Our current stack overflow hack works because we initialize the TLS variables containing the guard page range before we setup the sigaltstack, so when the signal handler is called (it is only called successfully when an alternate signal stack was installed, since the default stack is already full), it won't be the first access. That's assuming that no user code does a sigaltstack without also changing the signal handler – that's one of our implicit runtime assumptions.¹ Still, we rely on undocumented behaviour, as AS-safety isn't guaranteed for second TLS accesses.

The general problem is that unwinder bugs now become runtime CVEs if they can be chained in an exploit with a SIGSEGV.

Just for reference from our glibc NEWS this isn't a hypothetical problem we have had real CVEs here:

The __stack_chk_fail one is not relevant here, that function is used for stack corruption, which can only result from unsoundness. Stack overflows are not a memory safety issue however, and nothing is corrupted – the thread is just frozen in time until we return from the signal handler. So once we categorize the SIGSEGV as a stack overflow, we are not in a situation where we particularly need to worry about corrupted state.

By my reckoning,² if an exploitable memory safety bug is bad enough to be able to cause a SIGSEGV, masquerade it as a stack overflow and trigger a bug in the stack overflow handler, the bug is bad enough that it can be used to do pretty much everything even without exploiting another bug.

jmp_buf jmp;
setjmp(jmp);

The problem isn't getting the mcontext_t (that's passed to us by the kernel), it's retrieving the stack pointer from it. Almost every platform does that in a different way.

Oh man, we should really document those somewhere... ↩
Based on very limited experience with exploitable memory safety issues – it's just something we rarely get in Rustland. So please feel free to correct me! ↩

the8472 · 2025-01-15T17:47:51Z

We discussed this during today's T-libs meeting. While we do want the stackoverflow debugging experience to improve we think this approach is on too shaky ground.

Publishing official documentation how to get stack traces with debuggers and changing the static message to include a link to that was discussed as a simpler alternative. The other alternative as already mentioned in this thread would be out-of-process backtraces.

the8472 · 2025-01-15T17:50:01Z

@rfcbot fcp close

rfcbot · 2025-01-15T17:50:20Z

Team member @the8472 has proposed to close this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

bors · 2025-01-15T22:03:48Z

☔ The latest upstream changes (presumably #135540) made this pull request unmergeable. Please resolve the merge conflicts.

rfcbot · 2025-01-27T08:32:23Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2025-02-06T08:51:06Z

The final comment period, with a disposition to close, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

joboet · 2025-02-09T10:06:55Z

Closing as per T-libs decision.

rustbot assigned jhpratt Nov 18, 2024

rustbot added O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 18, 2024

This comment has been minimized.

Sign in to view

joboet force-pushed the stack_overflow_backtrace branch from 09b79d7 to 7c5af8e Compare November 18, 2024 12:48

bjorn3 reviewed Nov 19, 2024

View reviewed changes

rustbot added the I-libs-nominated Nominated for discussion during a libs team meeting. label Nov 25, 2024

Amanieu reviewed Nov 27, 2024

View reviewed changes

the8472 reviewed Nov 27, 2024

View reviewed changes

library/std/src/sys/pal/unix/stack_overflow.rs Show resolved Hide resolved

std: limit the length of the stack overflow backtrace

e5e0a47

rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-close This PR / issue is in PFCP or FCP with a disposition to close it. labels Jan 15, 2025

Amanieu removed the I-libs-nominated Nominated for discussion during a libs team meeting. label Jan 22, 2025

rfcbot added the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Jan 27, 2025

rfcbot removed the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Jan 27, 2025

rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Feb 6, 2025

rustbot mentioned this pull request Feb 6, 2025

Tracking issue for release notes of #133170: std: print a backtrace on stackoverflow #136625

Closed

3 tasks

joboet closed this Feb 9, 2025

apiraino removed the to-announce Announce this issue on triage meeting label Apr 30, 2025

std: print a backtrace on stackoverflow #133170

std: print a backtrace on stackoverflow #133170

Uh oh!

Conversation

joboet commented Nov 18, 2024

Uh oh!

rustbot commented Nov 18, 2024

Uh oh!

This comment has been minimized.

bjorn3 Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

bjorn3 Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joboet Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

joboet commented Nov 25, 2024

Uh oh!

Amanieu Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

joboet Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Amanieu Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

joboet Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

teor2345 Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cuviper commented Nov 27, 2024

Uh oh!

codonell commented Nov 27, 2024

Uh oh!

Amanieu commented Nov 28, 2024

Uh oh!

joboet commented Nov 28, 2024

Footnotes

Uh oh!

joboet commented Nov 28, 2024

Uh oh!

codonell commented Nov 28, 2024

Uh oh!

bjorn3 commented Nov 28, 2024

Uh oh!

joboet commented Nov 28, 2024

Footnotes

Uh oh!

the8472 commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the8472 commented Jan 15, 2025

Uh oh!

rfcbot commented Jan 15, 2025 • edited by joshtriplett Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors commented Jan 15, 2025

Uh oh!

rfcbot commented Jan 27, 2025

Uh oh!

rfcbot commented Feb 6, 2025

Uh oh!

joboet commented Feb 9, 2025

Uh oh!

Uh oh!

bjorn3 Nov 19, 2024 •

edited

Loading

the8472 commented Jan 15, 2025 •

edited

Loading

rfcbot commented Jan 15, 2025 •

edited by joshtriplett

Loading