Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support debug info in Binaryen #2400

Closed
dschuff opened this issue Oct 20, 2019 · 19 comments
Closed

Support debug info in Binaryen #2400

dschuff opened this issue Oct 20, 2019 · 19 comments
Assignees

Comments

@dschuff
Copy link
Member

dschuff commented Oct 20, 2019

We need to do this if binaryen-optimized binaries are to be debuggable. This currently actually includes any emscripten output at all (since everything gets run through emscripten-finalize). Currently we support source positions but nothing else. Since Binaryen can make arbitrary transformations, we will probably have to approach the problem similarly to how LLVM and other compilers do it.

@yurydelendik
Copy link
Contributor

There is also similar discussion (limited to DWARF) in emscripten: emscripten-core/emscripten#8934 (comment)

As a first step, and not to over-complicate binaryen logic, it might beneficial to just track the change in instruction locations and wasm locations (memory, locals, globals, etc). Additional tools will understand a debug format and later apply this transform/delta to the original debug information, e.g. to custom DWARF sections.

@kripken
Copy link
Member

kripken commented Oct 23, 2019

Some questions: Does DWARF basically get emitted as a bunch of custom sections in the wasm (or a separate file) where they refer to instructions by offset? Are those offsets absolute in the wasm, or relative to the code section? I think wasm supports multiple code sections (that's how gc-sections works?), if so, are the DWARF offsets basically "code section index, binary offset in that section"?

@yurydelendik
Copy link
Contributor

Does DWARF basically get emitted as a bunch of custom sections in the wasm (or a separate file) where they refer to instructions by offset?

At this moment a bunch of custom sections that refer instructions by offset. It's possible to move that to an external file.

Are those offsets absolute in the wasm, or relative to the code section?

Relative to code section at this moment. (But it can be changed to be file wide)

I think wasm supports multiple code sections (that's how gc-sections works?), if so, are the DWARF offsets basically "code section index, binary offset in that section"?

At the moment of design, wasm supports only one code sections. Object files contain relocatable entries which can originate from DWARF sections, and in one-code-section wasm they will be adjusted by linker. AFAIK binaryen ignore relocation section though.

@kripken
Copy link
Member

kripken commented Oct 24, 2019

Thanks @yurydelendik! I think I get it.

What do you think of the following idea: Binaryen in general decreases wasm code size, so by padding with nops we could make almost every original instruction be at the same offset it was before (by noting its original position when reading, then when writing add padding before each instruction, which is trivial; we may also want to disable some optimization passes that reorder things).

This means binaryen wouldn't decrease code size in builds with full debug info, but that doesn't sound so bad. (edit: it would still decrease gzip size ;)

If the DWARF were say 95% accurate, with only a few things in the wrong place, would that be useful enough?

@yurydelendik
Copy link
Contributor

writing add padding before each instruction, which is trivial; we may also want to disable some optimization passes that reorder things

Correct, non-reordering and non-growing code transformation is trivial -- DWARF can be preserved by the described idea. In general case, users will not need this narrow use case. The promise of DWARF information that it will be preserved and used even with highly optimized code, e.g. to break and inspect the stack trace in an optimized code, or help to analyze the stack dump. Bottom line is that a debug information has to be created (complete or partially complete) if user ask us to do that, regardless of debug/release or optimization level choice.

@kripken
Copy link
Member

kripken commented Oct 24, 2019

Do you think it's not possible to get 95% of debug info this way in an optimized build? Or are you saying even 95% is not enough?

@yurydelendik
Copy link
Contributor

Do you think it's not possible to get 95% of debug info this way in an optimized build?

Not at all. I'm just saying that users would like to have debug info for binary they will redistribute (instead of special binary for debugging). I'm afraid, at the end, the "padding" solution may end up to be a more complicated one and less desirable for users than just tracking address and locals transforms and applying them to pre-optimized DWARF.

@kripken
Copy link
Member

kripken commented Oct 24, 2019

For redistribution I agree the extra size is not great. It would not affect gzip size, at least, but yeah, if shipping binaries built with full debug info is common, then this is not optimal.

This does sound much simpler than other options, though? It would take just a few hours to write. Are you saying the other options are also very quick to implement?

@yurydelendik
Copy link
Contributor

It would take just a few hours to write.

It is okay to accepted it as temporary solution if it requires this amount of effort IMO.

Are you saying the other options are also very quick to implement?

https://github.com/yurydelendik/wtmaps-utils is in working state though not production quality (no tests or documentation). It is designed to fit in emscripten's pipeline without any change in the binaryen. Also it will require more effort to be user friendly.

@dschuff
Copy link
Member Author

dschuff commented Oct 25, 2019

It's definitely a common use case to build a completely-optimized shippable binary with debug info and then strip it out, shipping the stripped binary and archiving the binary with debug info. This gtes you the best of both worlds, shipping the smallest thing, while allowing symbolization or debugging of stack traces and memory dumps in the wild.

I'm not sure the nop-padding would work well anyway. We'd either have to be very careful about which optimizations we run (no inlining, reordering, coalescing, etc) or do enough work to keep things working, which might be as hard as either modeling the debug info in the IR or tracking the deltas from all the transformations.

If I'm interpreting it right, wdwarf-cp records all the code addresses in the input and builds a map of input address -> output address, and then rewrites the dwarf with the different addresses? And it relies on Binaryen to track all the changes from source to destination?

I don't see anything in there about locals and other non-address things. For addresses this should mostly work already because Binaryen supports location data in the IR and input/output source maps. So we don't have to do anything when moving nodes around, and when creating/replacing nodes, we just need to copy source locations from the input. This is straightforward because an address for input source map info is always an address in the output as well.

But in order to make other info work, I think it would be harder. At minimum we'd need to track e.g. input and output local variables. But what if we do something like local coalescing; several locals each turn into a live range of a single local. There won't be one single correct mapping; different DIEs will get rewritten differently. Or any optimization that moves a local to a stack slot or vice versa. I don't know of any way to reason about stack slots other than locally.
It sounds like this approach could work, in that you'd get all of the debug info (even debug info you don't understand yet) transferred to something in the output (even if it's mangled somehow). If we assume that this is O0 output, then every variable lives on the stack and has well-defined values at sequence points, so maybe we don't need locals at all. But it would be limited.

Of course the other option is building up some debug info model in the IR, and then translating more and more of the input info into that as we understand more. That has the downside that you don't emit the info at all until you understand it.

@yurydelendik
Copy link
Contributor

If I'm interpreting it right, wdwarf-cp records all the code addresses in the input and builds a map of input address -> output address, and then rewrites the dwarf with the different addresses? And it relies on Binaryen to track all the changes from source to destination?

Somewhat correct. First step is to generate source maps (using wtmaps) that will generate "identity" source map that will refer itself. Binaryen, which understands the source map format today, will generate updated source map after the wasm-opt run. And as a third step, the wdwarf-cp will use pre-opt DWARF and source map to generate final DWARF.

I'm not sure the nop-padding would work well anyway

Since the identity map was based on exact addresses of the instructions, the debug information can be restored with great degree of success even after advanced optimizations.

I don't see anything in there about locals and other non-address things.

Correct. There is nothing in the binaryen to track locals changes, also it will be tricky to embed that transform into source maps (but possible). It's a work to be done.

@kripken
Copy link
Member

kripken commented Oct 29, 2019

For locals, sounds like we'd need to add tracking in binaryen. That's probably not too hard as only 3 or so passes do major changes to them (simplify-locals, coalesce-locals, merge-locals).

For instruction addresses, I'm not sure I've understood the plan here. Is it that Binaryen emits metadata of "the instruction that began at address 1234 in the binary is now at 5678 in the new binary", for each instruction? I don't quite understand how that relates to source maps support?

@yurydelendik
Copy link
Contributor

My plan was not to touch binaryen's logic and use source maps as a container for information. I am planning to document approach and algorithm at https://github.com/yurydelendik/wtmaps-utils .

"the instruction that began at address 1234 in the binary is now at 5678 in the new binary".. I don't quite understand how that relates to source maps support?

It does not have to be a source map format (though even locals movements tracking can be done with source maps). In my case a source map was selected just to store the mentioned above information, so it can be freely passed between multiple tools (wasm-opt, wasm-dis, wasm-as, etc.). It can be done with different format or via memory/API.

@kripken
Copy link
Member

kripken commented Oct 29, 2019

I think I see now, thanks @yurydelendik !

@kripken
Copy link
Member

kripken commented Nov 7, 2019

Some investigation:

  • LLVM from a few months ago cannot properly read new LLVM's debug info, llvm-dwarfdump shows errors. That by itself makes me think the best option here is to just use LLVM, and not any other dwarf implementation. However, otherwise I didn't see any worrying things in other tests, so maybe that's overly pessimistic? (e.g. I got pyelfutils to scan the dwarf sections in wasm files, and it seems to show the right output)
  • Looking at non-LLVM codebases libdwarf seems the best in C, but is LGPL, which may be an issue for us in Binaryen as some of our builds are static (binaryen.js).
  • Looking at integrating wtmaps-utils in emscripten, I filed some issues, but it's probably that I'm doing something wrong...

@kripken kripken self-assigned this Dec 10, 2019
kripken added a commit that referenced this issue Dec 19, 2019
This imports LLVM code for DWARF handling. That code has the
Apache 2 license like us. It's also the same code used to
emit DWARF in the common toolchain, so it seems like a safe choice.

This adds two passes: --dwarfdump which runs the same code LLVM
runs for llvm-dwarfdump. This shows we can parse it ok, and will
be useful for debugging. And --dwarfupdate writes out the DWARF
sections (unchanged from what we read, so it just roundtrips - for
updating we need #2515).

This puts LLVM in thirdparty which is added here.

All the LLVM code is behind USE_LLVM_DWARF, which is on
by default, but off in JS for now, as it increases code size by 20%.

This current approach imports the LLVM files directly. This is not
how they are intended to be used, so it required a bunch of
local changes - more than I expected actually, for the platform-specific
stuff. For now this seems to work, so it may be good enough, but
in the long term we may want to switch to linking against libllvm.
A downside to doing that is that binaryen users would need to
have an LLVM build, and even in the waterfall builds we'd have a
problem - while we ship LLVM there anyhow, we constantly update
it, which means that binaryen would need to be on latest llvm all
the time too (which otherwise, given DWARF is quite stable, we
might not need to constantly update).

An even larger issue is that as I did this work I learned about how
DWARF works in LLVM, and while the reading code is easy to
reuse, the writing code is trickier. The main code path is heavily
integrated with the MC layer, which we don't have - we might want
to create a "fake MC layer" for that, but it sounds hard. Instead,
there is the YAML path which is used mostly for testing, and which
can convert DWARF to and from YAML and from binary. Using
the non-YAML parts there, we can convert binary DWARF to
the YAML layer's nice Info data, then convert that to binary. This
works, however, this is not the path LLVM uses normally, and it
supports only some basic DWARF sections - I had to add ranges
support, in fact. So if we need more complex things, we may end
up needing to use the MC layer approach, or consider some other
DWARF library. However, hopefully that should not affect the core
binaryen code which just calls a library for DWARF stuff.

Helps #2400
kripken added a commit that referenced this issue Dec 19, 2019
Optionally track the binary format code section offsets,
that is, when loading a binary, remember where each IR
node was read from. This is necessary for DWARF
debug info, as these are the offsets DWARF refers to.

(Note that eventually we may want to do something
else, like first read the DWARF and only then add
debug info annotations into the IR in a more LLVM-like
manner, but this is more straightforward and should be
enough to update debug lines and ranges).

This tracking adds noticeable overhead - every single
IR node adds an entry in a map - so avoid it unless
actually necessary. Specifically, if the user passes in
-g and there are actually DWARF sections in the
binary, and we are not about to remove those sections,
then we need it.

Print binary format code section offsets in text, when
printing with -g. This will help debug and test dwarf
support. It looks like

;; code offset: 0x7

as an annotation right before each node.

Also add support for -g in wasm-opt tests (unlike
a pass, it has just one - as a prefix).

Helps #2400
@kripken
Copy link
Member

kripken commented Dec 20, 2019

An update: with #2545 (not yet merged at this time) we can update DWARF debug line info in binaryen.

@kripken
Copy link
Member

kripken commented Dec 21, 2019

Ok, with #2545 + emscripten-core/emscripten#10092 I can emit a wasm binary from emscripten that looks like it has valid DWARF debug line info (using -gforce_dwarf, the current temporary option for it). Reading the dwarfdump info, it looks correct to me, even after the binaryen tools in the middle!

I'd like to test this more seriously. I tried to load it in chrome and firefox, but none of the devtools appear to load it. I tried bare clang as well, trying to reproduce this blogpost from @RReverser but I must be doing something wrong - it says "source map detected" but nothing else happens? (that's on stable and dev, 79 and 81).

@kripken
Copy link
Member

kripken commented Mar 18, 2020

Closing this as in recent versions we have basically complete support for this: we read and write DWARF and it is valid as far as we know, even with optimizations.

A few optimization passes are currently disabled (things that mess with locals, mostly), and we can look into updating them depending on how important that is. But they only make a few % difference in code size, and since this is in debug builds, for most (but not all) use cases we should be good enough.

@kripken kripken closed this as completed Mar 18, 2020
@RReverser
Copy link
Member

I tried bare clang as well, trying to reproduce this blogpost from @RReverser but I must be doing something wrong - it says "source map detected" but nothing else happens? (that's on stable and dev, 79 and 81).

Ah yeah, forgot to post an update here, so in case someone else comes across this threads - AFAIK we have figured it out, found and fixed some issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants