-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debuginfo-based panic locations #2154
base: master
Are you sure you want to change the base?
Conversation
I am very strongly in favor of removing those panic messages from the binary because they waste space, cannot be stripped, and it is just a poor man's version of actual debuginfo. |
# Guide-level explanation | ||
[guide-level-explanation]: #guide-level-explanation | ||
|
||
* The Rust compiler simply continues to generate platform-compatible debug information. It gains a new switch that causes it to generate *external* debug info. What this means is that the binary itself is stripped - all debug info is written to a separate **symbol file** (in a platform-specific format). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this switch do on platforms that don't support external debug info?
What does this switch do on platforms where the debug info is already external anyway?
Should there be a negated version of this switch to force debug info to be internal?
What would the negated switch do on platforms that don't support internal debug info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As specified in the referencel-level explanation, this is not a flag but an additional value for the existing debuginfo
switch. The idea is that fine-grained control over what kind of debuginfo you want doesn't really make sense when it's external: Just take it or leave it, size shouldn't be a concern. I'm not really happy with this design, so bikeshed to your heart's content.
AFAIK all supported platforms are either DWARF or PDB. As DWARF can do both but PDB is always external, this basically means that debuginfo
levels 2 and 3 do the same thing on windows-msvc
.
[guide-level-explanation]: #guide-level-explanation | ||
|
||
* The Rust compiler simply continues to generate platform-compatible debug information. It gains a new switch that causes it to generate *external* debug info. What this means is that the binary itself is stripped - all debug info is written to a separate **symbol file** (in a platform-specific format). | ||
* Cargo's release profile enables external debug info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in fact one of the most common roadblocks people have on windows-msvc
. They have a release build and they get a panic backtrace which has symbols from std but not their own crate, because the linker always emits .pdb debug info, but rustc itself is not putting debug info in their crate. The solution is to add [profile.release] debug = true
to their Cargo.toml
and having this be the default would remove a major source of friction.
* Panic messages now always include a backtrace (instead of the old location string). | ||
* This means that the root cause of panics from `unwrap()` and similar interfaces are now much clearer. | ||
* It's simple and obvious as most modern languages print backtraces for unhandled errors. | ||
* The backtrace is generated from debug information. Without debug symbols, memory addresses are printed instead. The standard `addr2line` utility can be used to manually obtain location information once debug info is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the equivalent utility be for windows-msvc
? This is already an issue in Rust today, so it's not like the RFC is causing a need for such a utility, but it sure would be nice to have a solution at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cross-platform equivalent of addr2line
appears to be llvm-symbolizer
(perhaps the RFC should refer to that instead?). It gained windows support in 2015. Its output is very similar to addr2line
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do addr2line and llvm-symbolizer work on position independent executables? I'm not sure how it would be able to based off of just a the binary and an address but I could be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, perhaps I should have specified this: Naturally, these symbolization tools always work with addresses into the binary, not runtime memory addresses. You simply query the base/load address of each loaded object and subtract (my proof of concept assumes one single statically-linked binaries, for anything more complex you would of course have multiple baseptrs and figure out which one to subtract) that from your addresses before printing them.
You're right, without this step randomized addresses would be meaningless.
Can we actually put column info into debuginfo yet? There seem to be some bugs at least on some platforms connected to columns. I wouldn't want to regress on this. Edit: see also rust-lang/rust#42921 |
@est31 the DWARF format itself support columns. Not sure about PDB. |
Last I checked, I did not see an API in dbghelp to get the column number. That said, the only time lack of column numbers would actually be a regression is when you have multiple panics on the same line. Not multiple function calls which could panic, but multiple panics directly inline. |
The PDB format does support column numbers. However, in rust-lang/rust#42921 (comment) @est31 details why it's disabled right now:
LLVM uses the newer DIA debuginfo API which does provide column information. LLVM even forwards this in their own PDB API even though it's not used anywhere. Patching LLVM to write column end info as well can't be that hard (although the poking I just did suggests that it may be a lot harder than it seems, at least) and I don't see why they would reject that patch. Updating the panic runtime to use DIA instead of dbghelp may or may not be worth the effort, but all I'm trying to say is that getting column numbers is certainly possible. |
That's when the regression would matter, right. Having multiple panics on the same line is more often than you think tho. E.g. take code like |
@est31 Multiple panics on the same line in that case would only apply when the |
I'm in favor of building debug information by default and always generating stack traces instead of requiring setting I also think we should support multiple options of how much panic information to embed in binaries:
|
@Zoxc According to #2070 (comment) it's possible to optimize messages out if the handler doesn't use them, at least in the Your "partial backtraces for blamed functions" proposal's biggest drawback is that annotating a function with
It's a valid concern but I'm starting to think that it seems a lot worse than it really is. My experience with embedded platforms or other problematic environments is very limited, so I would very much like to hear other opinions on this but from what I can see basically everything that isn't MSVC is DWARF. And if your embedded device has any way to report errors at all why can't you just use addr2line from your cross toolchain to decode the backtrace addresses? |
Well it avoids problems when who to blame is unclear. It's also consistent with how other backtraces work. Also you won't get dropped into a some function without the path to the actual panic being clear, which is a very important point. Functions unknown to you might have
Manually typing in addresses in My toolchain may not even support generating debug information. Even if it does, it's probably buggy and I may have to write a linker script which correctly generates said debug information (so I'd need to know how DWARF works). This situation is not made better by the fact you may have to mix code for different modes on x86 (x86-16, x86-32, x86-64). Forking |
Meta question: does the RFC process include contributions to LLVM? As in, can the RFC state "change LLVM to include column/line end info"? And @main-- could you include something like that? From a rustc side, this should be manageable. rustc manages span end positions internally, it should be easy to pass them to LLVM. Edit: pointing out that if we include end column info, we should also include end line info, because if the span ends on a new line, it might highlight a wrong span. |
@Zoxc I imagine your workflow is already quite automated if copy-pasting addresses is a significant inconvenience. Extending that to automatically symbolize stack traces shouldn't be too hard, I hope?
Why not? Is rustc/LLVM unable to generate DWARF for some targets? If yes, why?
What do you mean? Of course if you're writing a kernel or something like that you always need a linker script to make sure the binary looks like it should. But that's a |
After doing some digging, it turns out that dbghelp on Windows does allow getting the column. Both The other big issue is inlined functions. Because walking the stack inherently does not know about inlined functions, the only way to display inlined functions in a backtrace is to ask the debug info for any inlined functions at that address. As it turns out, it seems dbghelp does have a way to get this information, by calling Of course the real issue now is whether LLVM is capable of putting column info and inlined frames into the debuginfo. It certainly can't hurt to have |
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
The single major drawback here is implementation complexity. Walking the stack is usually quite simple but parsing debug info to obtain symbols is not. On the other hand, backtraces already work on most platforms and are even essential to debugging for many users, so most of this is either already solved or worth the effort on its own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debuginfo is commonly 3-5x the size of the main binary. Relying on that much stuff to get line information seems like a drawback to me.
For reference, we build a service at work with level 1 debuginfo to have symbolicated stack traces. The binary is 53 MiB, of which 41 MiB is debuginfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand debuginfo can easily be left out, whereas line information hardcoded into the binary is always there no matter what you do wasting space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much space do those hardcoded lines take up? Has anyone proposed a flag to turn off line!()
&co?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFC proposes to replace the file!()
+line!()
panic locations with backtraces so it's always "turned off" for panics.
I'm aware that debuginfo can be huge, I just don't see why that's a problem:
- This RFC doesn't change anything for debug builds.
- In release builds:
- Compile time is a lot less important
- Size of the binary itself doesn't change (debug info is external)
- Sure, the symbol file consumes disk space. But when is that a problem? You don't usually ship it to customers/production.
- In cases where you do want debuginfo in production for some reason (like the example you mentioned), you can still just ask the compiler to give you what you want.
So what does change?
- By default, you no longer print panic locations in production / to customers. The underlying rationale of both this RFC and RFC: Implicit caller location (third try to the unwrap/expect line info problem) #2091 (that, I think, everyone agrees with at this point) is that today's panic locations are rarely useful ("unwrap failed in option.rs") so I don't think this makes any difference.
- However (the big win): When you see a (raw, unsymbolicated) stack trace in a log file or a bug report from a customer you can now symbolicate it to get a full stack trace. If you have a core dump, you can just fire up a debugger and you see everything!
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation | ||
|
||
**Symbol files** already exist (and work) on Windows and MacOS. On Linux, it's as simple as writing the main binary to one ELF image and all the debug sections to another. The existing `-C debuginfo` compiler options gains a third level "3 = full debug info in an external symbol file". The cargo release profile sets `debuginfo=3` (instead of `debuginfo=0`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location of the debuginfo is orthogonal to its detail. Why couldn't I have out-of-binary level 1 debuginfo? It seems like it would make more sense for this to be a separate flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale here: #2154 (comment)
I'm not very happy with this either, I don't care about it. The only reason this isn't a new compiler flag is that I was too lazy to write up the whole "new flag, -Z/-C, stabilization, Cargo.toml" boilerplate.
|
||
The main purpose of panic location information as well as backtraces and debuggers is **debugging**: Finding and fixing the bug. Debugging code written in a compiled systems language is a hard problem but Rust is not unique here - the C ecosystem was facing exactly the same problem and debug info (in a platform-specific format like DWARF or PDB) exists to solve it. Of course, Rust is already leveraging this to support both backtraces and debuggers. But from this perspective it certainly seems very redundant to compile dedicated panic location strings when this information already exists in a binary's debug sections. | ||
|
||
Furthermore, there is no automated solution right now that could remove panic location strings from binaries. This is a problem for developers of closed-source software as it can, in some circumstances, faciliate reverse engineering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this RFC deal with this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that if the standard panic handler just discards file!()
and line!()
, I would expect the optimizer to remove them (see #2070 (comment)). This is nice because it avoids having to deal with reworking all of those panic internals (although doing that is probably a cleaner and better solution).
Just like the design of the compiler flag, this is a detail that I wanted to gloss over until later rounds of feedback (it already took me a while to find the time to write up this first draft, no point in delaying it further for things that don't really matter when I'm just looking for feedback on the basic idea).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is std::panic::set_hook
going to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting from PanicInfo::location()
:
This method will currently always return
Some
, but this may change in future versions.
Payload still exists of course. And location can be resolved from debuginfo or just return None
when that fails. Adding a method to PanicInfo
that returns the entire stack trace seems like a good idea too.
I'm not so sure I like this unless we also create some very good symbol infrastructure. Both at Rust and in my past job, I often had to debug crashes at machines I don't have access to. If there are multiple versions of my executable (and perhaps, multiple versions of dynamic libraries) then tracking down the exact symbol file for the version that's installed on the machine is a non-trivial job, and assembling them correctly at my configuration so that I'll have correct line-numbers is a very non-trivial job. I hope that for static cargo builds this would be easier - you only need to track down the debuginfo for the static executable. |
Do we have panic backtrace support on the wasm platforms? It would be pretty sad to regress here. |
@est31 I guess no. In panic=abort mode, in theory, embedder might print call stack with source locations, provided that there is source maps in binary. I've not seen such support yet but I've not checked for a while. |
Thanks @pepyakin. IIRC there is source maps support for emscripten, but not for the unknown target, I've filed a bug for the unknown target. All this together makes me believe that this RFC should be postponed as there is serious implementation work still needing to be done. This includes:
Panic location info might be "poor man's debuginfo" but at least it is working consistently across platforms. We can extend them with support to strip them or put them into a separate file. This is much simpler than implementing any of the four points above. Even with this RFC merged, I support merging of #2091 because it would allow for debuginfo based panic emitters to display the first "semantic" place where you panic. With the current panic backtraces (and debuginfo based panic locations are no different here) you got a long list of functions, beginning with panic machinery. |
I don't really understand what you mean. The RFC process always was about discussing ideas before doing the implementation work.
Interestingly, this is the first time I've seen anyone question the core rationale behind both this RFC and #2091. What good is something that works when it just doesn't help?
I think something like the |
Correct me if I'm wrong, but the "implementation work" for this RFC consists of removing panic location info, updating docs, and nothing else, right? If that is being done, without fixing the bugs, we would have the regressions I described here. In the past RFCs have been postponed because of implementation work still needing to be done. This is nothing new. example 1, example 2. The support for debuginfo that we have is too shaky right now, so IMO it should improve before we switch to it. In the end, a change needs to be implementable in the real world.
I think so, all it needs is getting all the wasm targets emit sourcemaps. I'm an user of the unknown wasm target and don't have any line numbers in my backtraces. If the compiler switched towards debuginfo based panic locations, it would mean that I don't have any info at all. #2091 would greatly help me. When a panic occurs, the developer console does show you a "backtrace" consisting of a bunch of
Often panic loc info does help. E.g. when you encounter unimplemented!() or panic!() invocations inside your code. Sure, sometimes the file name and line/col numbers are useless, but often they actually point to the correct location. It's not black vs white. I don't question the core rationale behind #2091, it would certainly improve panic loc info. |
No, the panic runtime also needs changes to always print backtraces by default. And to me that includes fixing the remaining issues, probably by moving away from libbacktrace. My reasoning here is:
It is indeeed worse than I'd like it to be, but I feel like a big reason for that is that not a lot of things depend on it right now.
I understand but that's not what I'm proposing. Of course debuginfo has to work before the switch can happen. |
I don't say that we shouldn't fix the issues. We should! And once we've done it, we should have a separate look at this RFC.
DWARF might work on GNU/Linux, but backtraces are still not functional on OSX nor do we have column numbers on any target. |
I don't understand. At this point I'm just asking "Do we want this?". None of what I'm proposing / needs to happen is very hypothetical or written in the stars, all of this can and does work. When you suggest to "look at this RFC at a later point" what is your idea? The only reason I initially took the time to write up and post this RFC was that #2091 was approaching merge-FCP and since I'm sure that's the wrong approach I felt the need to present a counterproposal. All of this got stalled out by the impl period, I guess. But nevertheless I just don't see how the fact that a big part of the implementation work for this solution consists of fixing issues that should be fixed anyways is a reason to delay discussion about what's best.
Correct my if I'm wrong, but AFAICS backtraces were fully functional (rust-lang/rust#44251) on OSX at one point, but crashed due to that other bug where the compiler was emitting invalid DWARF (rust-lang/rust#45511) which ultimately led to a full revert (rust-lang/rust#45760) since @alexcrichton realized that due to rust-lang/rust#21889, libbacktrace cannot be used securely on OSX. Note how most problems (this, missing inliened frames, etc) actually tend to circle back to libbacktrace. All of that goes away if we switch to a different implementation, for example gimli addr2line (I contribute to that one). Don't forget that OSX is still just DWARF. Just like basically the entire Unix world is DWARF. Or any platform that builds with a GNU-based toolchain, which should cover most embedded devices. The world is DWARF. The only exceptions to this are Windows (PDB, but a non-issue since it already brings its own parser) and Web Stackmaps (arguably a very special and young case, as I already said I'm not familiar with that).
But ... it's not that hard. Again, all of the underlying formats support columns. By far the hardest part here is just a small patch to LLVM to take care of the Windows issue. |
It can work, but it doesn't. You said that the "implementation work" of this RFC
Once we've switched to that and we have column info everywhere, I believe we can merge this RFC. If the RFC listed all the blockers by name I'd be more comfortable. |
Regarding the |
We discussed this in a T-compiler meeting, though we did not have significant attendance. We decided that we would like to see this proposal as a MCP, specifically from someone (or a group of people) with the intent to see the implementation and design work through. We do not think that it requires a full RFC, though there may need to be an FCP for stabilization if we end up with an implementation that needs to be opt-in rather than just on by default. |
I'm locking this thread while I figure out what's going on. |
Discussed during a T-compiler steering meeting. The key three proposals the RFC seems to make are:
Of these, item 1 shipped in Rust 1.65 and the other two items are under the purview of T-cargo and T-libs. We suggest interested individuals reach out to those teams to determine the best way to begin pursuing the remaining items. |
Given that a lot of time has passed, I'd like to clarify the original intent of this RFC. The "heart" of this RFC was really more about shifting the direction where the ecosystem is going, rather than the hard technical specifics. At the time, I saw an opportunity to use the sorry state of panic messages ("panic in option.rs line 335") as inspiration/motivation for the Rust ecosystem as a whole to embrace unwinding and stack traces for debugging. Today we have things like
And to be completely honest with you here, this just makes me sad. It makes me feel like the ship has sailed on this RFC. The whole point of this proposal was to prove that if done right, backtraces are fast and cheap! Obviously not run-this-in-a-tight-loop-without-noticing-cheap, but more than cheap enough for handling any type of application error. The problem is that to this day, people still conflate the process of capturing a stack trace (i.e. walking the stack to grab a couple of addresses) with the process of symbolicating that stack trace (which first of all requires having the full debuginfo, then parsing it and finding the symbols, potentially demangling those, etc). The second part is the really expensive one. The first part is not. Your list is arguably correct, but it almost makes it sound like this RFC is concerned with implementation details, when the exact opposite is true! This really aims to solve two distinct, but related problems:
|
Rendered
For a quick mockup of the basic idea, have a look at this.