Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThinLTO generates bad DWARF for libtest #45511

Closed
philipc opened this issue Oct 25, 2017 · 34 comments · Fixed by #46008
Closed

ThinLTO generates bad DWARF for libtest #45511

philipc opened this issue Oct 25, 2017 · 34 comments · Fixed by #46008
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) C-bug Category: This is a bug. P-medium Medium priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly.

Comments

@philipc
Copy link
Contributor

philipc commented Oct 25, 2017

nightly-2017-10-24 contains a libtest with bad DWARF:

$ /usr/bin/dwarfdump -i ~/.rustup/toolchains/nightly-2017-10-24-x86_64-unknown-linux-gnu/lib/libtest-f00235123717627d.so > /dev/null

/usr/bin/dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x00003085>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)

I can reproduce this if I recompile libtest locally:

$ cd src/libtest
$ RUSTFLAGS="-g -Ccodegen-units=16 -Zthinlto" cargo +nightly-2017-10-24 build --release
   Compiling term v0.0.0
   Compiling getopts v0.2.15
   Compiling test v0.0.0
    Finished release [optimized] target(s) in 8.77 secs
$ /usr/bin/dwarfdump -i ../target/release/libtest.so > /dev/null

/usr/bin/dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x0001b758>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)

The problem can be fixed by disabling thinlto (and keeping the codegen units is ok).

I noticed this with nightly-2017-10-24 because it is the first nightly compiled with thinlto, but the problem also occurs if I compile libtest with thinlto enabled using earlier compilers (eg nightly-2017-10-17).

Inspecting the DWARF, a DIE at that local ref offset does exist in another codegen unit, so the problem seems to be that we should be using a global ref instead.

@michaelwoerister
Copy link
Member

cc'ing @tromey since he's been poking around in LLVM's DWARF generation code recently.

@kennytm kennytm added A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) C-bug Category: This is a bug. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Oct 30, 2017
@nikomatsakis
Copy link
Contributor

Sorta smells like an LLVM bug -- but hard to say. Who should investigate?

@nikomatsakis
Copy link
Contributor

triage: P-medium

@alexcrichton if you are able to investigate as part of ThinLTO work would be great.

@alexcrichton
Copy link
Member

It's likely that this is the cause of #45731, which even if the underlying segfault were fixed would likely prevent backtraces from happening on OSX.

@alexcrichton
Copy link
Member

This is unfortunately not "automagically fixed" if we upgrade to LLVM 5

@alexcrichton
Copy link
Member

Making progress.

First up, minimized Rust source code via creduce to get:

$ cat foo.rs
pub fn foo(a: &mut Vec<String>) {
    a.clone();
}
$ rustc +nightly --version
rustc 1.23.0-nightly (f0fe716db 2017-10-30)
$ rustc +nightly --crate-name test -g -Z thinlto --crate-type dylib foo.rs -O -C codegen-units=16
$ dwarfdump -i libtest.so > /dev/null
dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x0000131c>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)

Next up let's see if we can reduce it to one object file:

$ rustc +nightly --crate-name test -g -Z thinlto --crate-type dylib foo.rs -O -C codegen-units=16 --emit obj
$ for x in *.o; do echo $x; dwarfdump -i $x > /dev/null; done
test.test0-8787f43e282added376259c1adb08b80.rs.rust-cgu.o

dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x0000131c>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)
test.test1-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test2-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test3-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test4-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test5-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test6-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test7-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test8-8787f43e282added376259c1adb08b80.rs.rust-cgu.o
test.test9-8787f43e282added376259c1adb08b80.rs.rust-cgu.o

Ok looks like only one object file is invalid!

Let's take a look the LLVM IR:

$ rustc +nightly --crate-name test -g -Z thinlto --crate-type dylib foo.rs -O -C codegen-units=16 --emit llvm-ir
$ llc test.test0-*.ll -filetype=obj -O0 -o foo.o
$ dwarfdump -i foo.o > /dev/null

dwarfdump ERROR:  reference form with no valid local ref?!, offset=<0x00001496>:  DW_DLE_ATTR_FORM_OFFSET_BAD (119)

Ok let's see what's happening by minimizing the LLVM IR:

$ cat test.sh
#!/bin/bash

set -ex
llc \
  -filetype=obj \
   -O0 \
   $1 \
   -o foo.o

set +ex

dwarfdump -i foo.o &> foo.tmp

set -ex

if grep -q 'dwarfdump ERROR:  reference form with no valid local ref' foo.tmp; then
  echo interesting
  exit 1
fi
exit 0
$ chmod +x test.sh
$ bugpoint -compile-custom -compile-command=./test.sh ./test.test0-*.ll
...
$ llvm-dis bugpoint-reduced-simplified.bc

yielding this IR which I cleaned up a bit.

I checked out LLVM trunk, frobbed the IR a bit, and looks like the error is still on LLVM trunk.

@alexcrichton
Copy link
Member

@rust-lang/compiler I'm not sure at this point if this is an LLVM bug, a rustc bug, or a ThinLTO bug. I've managed to extract what I believe is a "minimal LLVM IR" for this which reproduces the error. I'm hoping fixing that IR will fix all other related issues to this at this point in time.

For those of you more familiar with debug information (probably everyone other than me!) mind taking a look at that IR? Does it look "obviously wrong"? If it's "obviously wrong" I can try to work backwards to figure out what piece of the pipeline introduced such a wrong piece, but if it looks correct I can open an LLVM bug saying that we think it may be broken (and maybe they'll say it's "obviously wrong"!)

cc @michaelwoerister

@alexcrichton
Copy link
Member

Er, I went ahead and opened an upstream bug to go ahead and broaden the search.

@philipc
Copy link
Contributor Author

philipc commented Nov 6, 2017

The DWARF is definitely badly formed: it's using a unit-relative offset to reference the parameter type, when it should be using a section-relative offset because the type is in a different unit. The bad offset is occurring for a parameter to subprogram !155 in bugpoint-reduced-simplified.ll.

The problem is that subprograms !128 and !155 share the same scope: !50, but have different units (!11 and !0). Based on some print debugging, it appears that LLVM doesn't handle this case. So !50 and !128 are constructed in unit !11, but when !155 is constructed, it reuses the existing !50 DIE as its parent, and so it also ends up in unit !11. So my guess is that when the parameter is constructed, it uses the unit that was specified to determine if the parameter and type are in the same unit, and thus gets it wrong.

@alexcrichton
Copy link
Member

@philipc thanks for the info! I know basically nothing about DWARF though and malformed whatnots, so the IR here is heavily edited by both machines (creduce/bugpoint) and me (to make it more readable). My assumption is that if LLVM accepts IR and generates an object file it shouldn't yield such a dwarfdump error, but are you thinking that given possibly valid LLVM IR the dwarfdump error is possible? If so that may make reduction far more difficult...

Or do you think that this looks like upstream LLVM needs to get updated?

@philipc
Copy link
Contributor Author

philipc commented Nov 6, 2017

I don't think the editing of the IR caused the problem. I'll check later, but I think it's quite likely that the same problem occurs for the unedited IR.

I don't know if the LLVM developers intend for this to be valid IR, or if they are missing a validation check somewhere. I think they may have never encountered this possibility. clang divides code into units based on files, so I don't think it ever splits code in the same scope across multiple units, whereas I think rust splits code into units quite differently. That's a bit of conjecture on my part though, it might be better to ask someone who knows more (I only know about DWARF, not LLVM or rustc).

Reading the LLVM code a bit more, I think it will be possible to fix there by stopping it from sharing the scope (the scope creation is initiated here and the sharing check is here).

It's also worth reconsidering how rust splits code into units though. It might make sense to ensure all code for a given type is kept in the same unit.

@alexcrichton
Copy link
Member

Ah ok that all makes sense. Note though that rustc is very much like clang when it comes to debuginfo/codegen units. The major difference here is that LLVM is doing inlining across codegen units via ThinLTO, and that's likely where these new situations are arising from.

@philipc
Copy link
Contributor Author

philipc commented Nov 7, 2017

I see, yep my understanding of how rust partitions units was wrong. So when using ThinLTO, the additional units added by ThinLTO only contain information needed for the additional inlined functions? Do these inlined functions always share scopes/types with unit !0, or can we determine when they do? If so, it'd be nice to add them to unit !0 instead of creating additonal units. I think this would get us DWARF that looks the same as that generated for #[inline]. eg dwarfdump is happy if I run:

sed -i 's/unit: ![0-9]*/unit: !0/g' test.test0-*.ll

@alexcrichton
Copy link
Member

@philipc interesting! I'd have to look into the internals of ThinLTO to know for sure though. It seems like LLVM isn't ready (maybe?) for multiple codegen units in debug info in one LLVM module (a codegen unit), but it makes sense to me what you're thinking about having all inlined debug info share the same codegen unit information. Now where that change would go in LLVM I'm not entirely sure...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Nov 10, 2017
Right now ThinLTO is generating bad dwarf which is tracked by rust-lang#45511, but this
is causing issues on OSX (rust-lang#45768) where `dsymutil` is segfaulting and failing to
produce output.

Closes rust-lang#45768
@philipc
Copy link
Contributor Author

philipc commented Nov 10, 2017

Does firefox use fat LTO? There's been a couple of bugs that have some similarities to this (#44412 and maybe #41775).

Also note that absence of dwarfdump errors doesn't mean the offsets are all okay. dwarfdump is only complaining if the offset is larger than the unit length, but even if it's less it may still incorrectly be a local ref. So there may be more instances that are unnoticed.

@michaelwoerister
Copy link
Member

cc @luser ^

@philipc
Copy link
Contributor Author

philipc commented Nov 11, 2017

I can reproduce using normal LTO. Library crate is:

pub struct Bar;

impl Bar {
    #[inline]
    pub fn baz(i: u32) {
        Self::qux(i);
    }

    pub fn qux(i: u32) {
        println!("{}", i);
    }
}

Binary crate is:

extern crate bar;

static I: u32 = 1;

fn main() {
    bar::Bar::baz(I);
}

Also here's a thin LTO version that gives smaller IR than the previous Vec test case:

pub static I: u32 = 1;

pub fn foo(i: u32) -> u32 {
    bar::Bar::baz(i)
}

mod bar {
    pub struct Bar;

    impl Bar {
        #[inline]
        pub fn baz(i: u32) -> u32 {
            Self::qux(i)
        }

        pub fn qux(i: u32) -> u32 {
            i * i
        }
    }
}

@alexcrichton
Copy link
Member

Thanks @philipc!

It definitely sounds like at this point this is an existing bug we've just never hit before. My inclination would be to paper over this if we can. Unfortunately though I don't know how to fix this at the LLVM IR layer. My naive idea would be to just point everything at one codegen unit for each LLVM module (debug-info-wise) but I don't know how to change existing debuginfo of a module.

@philipc
Copy link
Contributor Author

philipc commented Nov 12, 2017

I'm not able to access llvm.org currently, so not sure exactly what Adrian's suggestions were, but here's a patch that fixes LLVM to use a global ref when needed. It's not as simple as comparing parent DIEs, because the variable is initially created without a parent, so the check defaults to the current CU. I don't have a lot of confidence in the patch, but maybe it's a useful starting point for someone who knows more about LLVM.

@luser
Copy link
Contributor

luser commented Nov 13, 2017

We do use normal rustc LTO for Firefox builds, yes.

@tromey
Copy link
Contributor

tromey commented Nov 15, 2017

Would this be something you could do, @tromey? You can evaluate the changes to the generated DWARF best, I think

Sure, I will do this. I think the issue is more an LLVM thing than a DWARF thing, though; not certain I know enough there to say whether the patch is good, but I will at least do the shepherding. Also I wanted to point out that @philipc knows quite a bit about DWARF; he's one of the principals on https://github.com/gimli-rs/gimli

@tromey
Copy link
Contributor

tromey commented Nov 15, 2017

Actually maybe there's nothing to do here except merge this into rust-llvm? https://reviews.llvm.org/D39981

@tromey
Copy link
Contributor

tromey commented Nov 15, 2017

Yes, I think so. I'll do that.

@michaelwoerister
Copy link
Member

Cool, thanks!

@alexcrichton
Copy link
Member

@tromey oh no worries! I've tested it locally now and will submit a PR soon (with another change or two)

@tromey
Copy link
Contributor

tromey commented Nov 15, 2017

Ok, thanks @alexcrichton

@alexcrichton
Copy link
Member

I've opened up #46008 which I think should close this out (and also re-enables ThinLTO in a few locations)

alexcrichton added a commit to alexcrichton/rust that referenced this issue Nov 20, 2017
This commit updates LLVM to fix rust-lang#45511 (https://reviews.llvm.org/D39981) and
also reenables ThinLTO for libtest now that we shouldn't hit rust-lang#45768. This also
opportunistically enables ThinLTO for libstd which was previously blocked
(rust-lang#45661) on test failures related to debuginfo with a presumed cause of rust-lang#45511.

Closes rust-lang#45511
bors added a commit that referenced this issue Nov 22, 2017
rustbuild: Update LLVM and enable ThinLTO

This commit updates LLVM to fix #45511 (https://reviews.llvm.org/D39981) and
also reenables ThinLTO for libtest now that we shouldn't hit #45768. This also
opportunistically enables ThinLTO for libstd which was previously blocked
(#45661) on test failures related to debuginfo with a presumed cause of #45511.

Closes #45511
alexcrichton added a commit to alexcrichton/rust that referenced this issue Nov 24, 2017
This commit updates LLVM to fix rust-lang#45511 (https://reviews.llvm.org/D39981) and
also reenables ThinLTO for libtest now that we shouldn't hit rust-lang#45768. This also
opportunistically enables ThinLTO for libstd which was previously blocked
(rust-lang#45661) on test failures related to debuginfo with a presumed cause of rust-lang#45511.

Closes rust-lang#45511
bors added a commit that referenced this issue Nov 25, 2017
rustbuild: Update LLVM and enable ThinLTO

This commit updates LLVM to fix #45511 (https://reviews.llvm.org/D39981) and
also reenables ThinLTO for libtest now that we shouldn't hit #45768. This also
opportunistically enables ThinLTO for libstd which was previously blocked
(#45661) on test failures related to debuginfo with a presumed cause of #45511.

Closes #45511
ritiek added a commit to ritiek/rust that referenced this issue Nov 26, 2017
MIR: adopt borrowck test

Fix trailing whitespace

span_bug! on unexpected action

Make RegionVid use newtype_index!

Closes rust-lang#45843

Check rvalue aggregates during check_stmt in tycheck, add initial, (not passing) test

Fix failing test

Remove attributes and test comments accidentally left behind, add in span_mirbugs

Normalize LvalueTy for ops and format code to satisfy tidy check

only normalize operand types when in an ADT constructor

avoid early return

handle the active field index in unions

normalize types in ADT constructor

Fixes rust-lang#45940

Fix borrowck compiler errors for upvars contain "spurious" dereferences

Fixes rust-lang#46003

added associated function Box::leak

Box::leak - improve documentation

Box::leak - fixed bug in documentation

Box::leak - relaxed constraints wrt. lifetimes

Box::leak - updated documentation

Box::leak - made an oops, fixed now =)

Box::leak: update unstable issue number (46179).

Add test for rust-lang#44953

Add missing Debug impls to std_unicode

Also adds #![deny(missing_debug_implementations)] so they don't get
missed again.

Amend RELEASES for 1.22.1

and fix the date for 1.22.0

Rename param in `[T]::swap_with_slice` from `src` to `other`.

The idea of ‘source’ and ‘destination’ aren’t very applicable for this
operation since both slices can both be considered sources and
destinations.

Clarify stdin behavior of `Command::output`.

Fixes rust-lang#44929.

Add hints for the case of confusing enum with its variants

Add failing testcases

Add module population and case of enum in place of expression

Use for_each_child_stable in find_module

Use multiline text for crate conflict diagnostics

Make float::from_bits transmute (and update the documentation to reflect this).

The current implementation/documentation was made to avoid sNaN because of
potential safety issues implied by old/bad LLVM documentation. These issues
aren't real, so we can just make the implementation transmute (as permitted
by the existing documentation of this method).

Also the documentation didn't actually match the behaviour: it said we may
change sNaNs, but in fact we canonicalized *all* NaNs.

Also an example in the documentation was wrong: it said we *always* change
sNaNs, when the documentation was explicitly written to indicate it was
implementation-defined.

This makes to_bits and from_bits perfectly roundtrip cross-platform, except
for one caveat: although the 2008 edition of IEEE-754 specifies how to
interpet the signaling bit, earlier editions didn't. This lead to some platforms
picking the opposite interpretation, so all signaling NaNs on x86/ARM are quiet
on MIPS, and vice-versa.

NaN-boxing is a fairly important optimization, while we don't even guarantee
that float operations properly preserve signalingness. As such, this seems like
the more natural strategy to take (as opposed to trying to mangle the signaling
bit on a per-platform basis).

This implementation is also, of course, faster.

Simplify an Iterator::fold to Iterator::any

This method of once-diagnostics doesn't allow nesting

UI tests extract the regular output from the 'rendered' field in json

Merge cfail and ui tests into ui tests

Add a MIR pass to lower 128-bit operators to lang item calls

Runs only with `-Z lower_128bit_ops` since it's not hooked into targets yet.

Include tuple projections in MIR tests

Add type checking for the lang item

As part of doing so, add more lang items instead of passing u128 to the i128 ones where it doesn't matter in twos-complement.

Handle shifts properly

* The overflow-checking shift items need to take a full 128-bit type, since they need to be able to detect idiocy like `1i128 << (1u128 << 127)`
* The unchecked ones just take u32, like the `*_sh?` methods in core
* Because shift-by-anything is allowed, cast into a new local for every shift

incr.comp.: Make sure we don't lose unused green results from the query cache.

rustbuild: Update LLVM and enable ThinLTO

This commit updates LLVM to fix rust-lang#45511 (https://reviews.llvm.org/D39981) and
also reenables ThinLTO for libtest now that we shouldn't hit rust-lang#45768. This also
opportunistically enables ThinLTO for libstd which was previously blocked
(rust-lang#45661) on test failures related to debuginfo with a presumed cause of rust-lang#45511.

Closes rust-lang#45511

std: Flag Windows TLS dtor symbol as #[used]

Turns out ThinLTO was internalizing this symbol and eliminating it. Worse yet if
you compiled with LTO turns out no TLS destructors would run on Windows! The
`#[used]` annotation should be a more bulletproof implementation (in the face of
LTO) of preserving this symbol all the way through in LLVM and ensuring it makes
it all the way to the linker which will take care of it.

Add enum InitializationRequiringAction

Fix tidy tests
@da-x
Copy link
Member

da-x commented Jan 18, 2018

Hi,

I get a Dwarf Error with GDB 8.0.1 on one of the .so files of Rust 1.23.0 that was built was Rust 1.23.0 on Fedora 27, in a build enabled with debuginfo.

Is it related to this issue or should I open a new one?

[ 16:48 13 jupiter:1 ]$ gdb  /home/dan/var/rust/installed/bin/../lib/../lib/libsyntax-0faf2434b877299a.so
GNU gdb (GDB) Fedora 8.0.1-33.fc27
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/dan/var/rust/installed/bin/../lib/../lib/libsyntax-0faf2434b877299a.so...done.
(gdb) break  syntax::ext::base::ExtCtxt::monotonic_expander
Dwarf Error: Cannot find DIE at 0x61bae9 referenced from DIE at 0x638ed6 [in module /home/dan/var/rust/installed/lib/libsyntax-0faf2434b877299a.so]

Relevant build flags:

optimize = false
targets = "X86"
docs = false
sysconfdir = "etc"
debuginfo = true
debuginfo-lines = true

@alexcrichton
Copy link
Member

@da-x yeah I believe that's this issue, it should be fixed in 1.24 though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) C-bug Category: This is a bug. P-medium Medium priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants