Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc fails to remove dead code in exactly one instance of multiple similar code #48627

Closed
glandium opened this issue Mar 1, 2018 · 16 comments
Closed
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@glandium
Copy link
Contributor

glandium commented Mar 1, 2018

I made a mistake in some code, which essentially made everything statically panic. Which is fine. The problem is that in exactly one instance of the same kind of code, rustc actually failed to remove the non-panicking branch. Note this /might/ be a duplicate of #48253. At least, similarly to #48253, it appears to be a regression in 1.24.0, according to godbolt.

A copy/pastable-to-godbolt and reduced version of the mistaken code is:

#[allow(non_camel_case_types)] 
pub enum c_void {
    // Two dummy variants so the #[repr] attribute can be used.
    #[doc(hidden)]
    __variant1,
    #[doc(hidden)]
    __variant2,
}

pub extern "C" fn malloc(size: usize) -> *mut c_void {
    FUNCS.malloc.unwrap()(size)
}
pub extern "C" fn calloc(nmemb: usize, size: usize) -> *mut c_void {
    FUNCS.calloc.unwrap()(nmemb, size)
}
pub extern "C" fn realloc(ptr: *mut c_void, size: usize) -> *mut c_void {
    FUNCS.realloc.unwrap()(ptr, size)
}

pub struct MallocTable {
    malloc: Option<extern "C" fn(usize) -> *mut c_void>,
    calloc: Option<extern "C" fn(usize, usize) -> *mut c_void>,
    realloc: Option<extern "C" fn(*mut c_void, usize) -> *mut c_void>,
}

pub static FUNCS: MallocTable =
    MallocTable{malloc: None,
                calloc: None,
                realloc: None};

This generates the following assembly (skipping the data fields):

example::malloc:
  push rbp
  mov rbp, rsp
  mov rax, qword ptr [rip + example::FUNCS@GOTPCREL]
  mov rax, qword ptr [rax]
  test rax, rax
  je .LBB0_1
  call rax
  pop rbp
  ret
.LBB0_1:
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

example::calloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

example::realloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

Note how calloc and realloc are properly just directly panicking because FUNCS is immutable and initialized with empty function pointers, leading to unwrap always panicking, but the same doesn't happen for malloc. Interestingly, removing realloc makes malloc optimized the same way as the others.

The code generated by 1.23.0 was:

example::malloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2

example::calloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2

example::realloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2
@pietroalbini pietroalbini added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Mar 6, 2018
@nikomatsakis
Copy link
Contributor

triage: P-medium

Seems curious and worthy of fixing, but not a P-high problem. @glandium do you believe this optimization (or lack of it) is affecting other things?

@rust-highfive rust-highfive added the P-medium Medium priority label Mar 22, 2018
@glandium
Copy link
Contributor Author

@glandium do you believe this optimization (or lack of it) is affecting other things?

I think it could be the same underlying regression as #48253 but it's just a gut feeling, not backed by evidence. I guess I could take a stab at finding a regression range for both.

@glandium
Copy link
Contributor Author

Okay, so I confirmed both this and #48253 have the same regression range, which is the last days before 1.23 made it to beta. 1.23.0.beta.2 exhibits both regressions, and stable-1.23 doesn't.
Going through the list of changes between the beginning of december when beta.2 was released and the 1.23.0 tag in the git repository points to a very likely cause for the regressions having been fixed on 1.23: 4979bcc, which reverted the layout refactor, which f50fd07 apparently landed right before 1.23 merged to beta.

So this would be a regression from #45225, and #48253 is a dupe.

@glandium
Copy link
Contributor Author

So, to see whether this had an impact on actual code, I tried with some work-in-progress code to replace https://dxr.mozilla.org/mozilla-central/source/memory/replace/logalloc/ which happened to be much slower when compiled with 1.24 compared to 1.23 (670ms vs 550ms to process 1.3M lines).

So I tried the nightly before and after #45225, and ... they both were fast. Which is a good news of some sort, since despite it definitely affecting codegen as demonstrated, it didn't seem to have a real impact. At least not on my code.

However, tracking it down, I identified the regression came from #46623, which is related. That one seems to have made std::fmt::Write functions slower, as well as things related to Option and Result, and probably more. I'm not sure whether I should open a separate issue.

@nikomatsakis
Copy link
Contributor

@glandium wow, nice detective work!

@eddyb any clue on why #45225 or #46623 might have this effect?

@eddyb
Copy link
Member

eddyb commented Mar 23, 2018

My initial guess is that LLVM is very finicky and sensitive to our exact enum representation.
However, in the examples, this doesn't look like a dead code elimination bug, but rather failing to fold a load from a constant, which seems even more suspicious?

Can we get LLVM IR for the relevant example & version combinations?
Assembly doesn't tell much of a story, while the LLVM IR probably has an obviously foldable load.

@glandium
Copy link
Contributor Author

glandium commented Mar 23, 2018

LLVR IR for both the testcase here and the one in #48253, with nightly-2017-11-20:
https://gist.github.com/glandium/a73d38ae357154b194ec5eb9a4354730

with nightly-2017-11-21:
https://gist.github.com/glandium/7f0212770208d3a4105a7e275a10df2d

Those testcases don't seem to have been affected by #46623, so I'll try to reduce something from my code that got 20% slower.

At least, the LLVM IR is clear: it's not LLVM's doing.

@glandium
Copy link
Contributor Author

Here's a reduced example that is not affected by #45225 but is by #46623:

#[inline(never)]
fn parse_hex<'a>(input: &'a [u8]) -> Option<usize> {
    let mut s = input;
    let mut result: Option<usize> = None;
    loop {
        if let Some((&c, remainder)) = s.split_first() {
            let d = match c {
                d @ b'0'...b'9' => d - b'0',
                d @ b'a'...b'f' => d + 10 - b'a',
                d @ b'A'...b'F' => d + 10 - b'A',
                _ => break,
            };
            result = Some(match result {
                Some(r) => r.checked_mul(16usize)
                    .and_then(|r| r.checked_add(d as usize))?,
                None => d as usize,
            });
            s = remainder;
        } else {
            break;
        }
    }
    result
}

fn main() {
    for _ in 1..10000000 {
        match parse_hex("85af342b1".as_bytes()) {
            Some(_) => continue,
            _ => break,
        }
    }
}

150ms with nightly-2017-12-15 vs 260ms with nightly-2017-12-16. Note that the inline(never) and the stupid loop in main are there to ensure the compiler doesn't eliminate dead code, and to reduce the IR size. Using a loop that generates random or sequential numbers to parse, with no inline(never) shows a perf difference too.

LLVM IR with nightly-2017-12-15:
https://gist.github.com/glandium/69008d47f7b2e3ecec06a082cd3b2dc3

with nightly-2017-12-16:
https://gist.github.com/glandium/b7bd384139b6a58fb68c03bc420a566f

@glandium
Copy link
Contributor Author

And just in case, the LLVM IR with rustc built from the commit just before the merge of #46623, which is only trivially different from that of nightly-2017-12-15:
https://gist.github.com/glandium/45b90fe55e546edec1b5a9a452aba0aa
AFAICT, nightly-2017-12-16 was built from the merge of #46623, not a subsequent commit.

@eddyb
Copy link
Member

eddyb commented Apr 10, 2018

LLVR IR for both the testcase here and the one ...

Thank you! My suspicion was correct:

@_ZN4test5FUNCS17h4349eb3b1be595e8E = local_unnamed_addr constant { [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] } { [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef }, align 8

; ...

  %0 = load i8*, i8** bitcast ({ [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] }* @_ZN4test5FUNCS17h4349eb3b1be595e8E to i8**), align 8
  %1 = icmp eq i8* %0, null

LLVM should constant-fold these two and know that %0 is i8* null and %1 is i1 true.
I don't really know how it could fail such a trivial task (then again, their representation of constant data is a much poorer fit for this, than, say, miri's). cc @rkruppe

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Apr 11, 2018

After a quick skim of the constant folding helper functions I can't see what causes this, but I have minimized the issue, it appears the problem are the [0 x i8] and/or the non-trivial initializer. Either dropping the [0 x i8] before the field being loaded or using zeroinitializer the makes -constprop effective.

@jkordish jkordish added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Apr 18, 2018
@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 1, 2018
@nikic
Copy link
Contributor

nikic commented Dec 1, 2018

I've submitted https://reviews.llvm.org/D55169 to fix this issue.

llvm-git-migration pushed a commit to llvm-git-prototype/llvm that referenced this issue Dec 11, 2018
Struct types may have leading zero-size elements like [0 x i32], in
which case the "real" element at offset 0 will not necessarily coincide
with the 0th element of the aggregate. ConstantFoldLoadThroughBitcast()
wants to drill down the element at offset 0, but currently always picks
the 0th aggregate element to do so. This patch changes the code to find
the first non-zero-size element instead, for the struct case.

The motivation behind this change is rust-lang/rust#48627.
Rust is fond of emitting [0 x iN] separators between struct elements to
enforce alignment, which prevents constant folding in this particular case.

The additional tests with [4294967295 x [0 x i32]] check that we don't
end up unnecessarily looping over a large number of zero-size elements
of a zero-size array.

Differential Revision: https://reviews.llvm.org/D55169

llvm-svn=348895
earl pushed a commit to earl/llvm-mirror that referenced this issue Dec 11, 2018
Struct types may have leading zero-size elements like [0 x i32], in
which case the "real" element at offset 0 will not necessarily coincide
with the 0th element of the aggregate. ConstantFoldLoadThroughBitcast()
wants to drill down the element at offset 0, but currently always picks
the 0th aggregate element to do so. This patch changes the code to find
the first non-zero-size element instead, for the struct case.

The motivation behind this change is rust-lang/rust#48627.
Rust is fond of emitting [0 x iN] separators between struct elements to
enforce alignment, which prevents constant folding in this particular case.

The additional tests with [4294967295 x [0 x i32]] check that we don't
end up unnecessarily looping over a large number of zero-size elements
of a zero-size array.

Differential Revision: https://reviews.llvm.org/D55169

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@348895 91177308-0d34-0410-b5e6-96231b3b80d8
@nikic
Copy link
Contributor

nikic commented Dec 12, 2018

Patch landed upstream.

@nikic nikic self-assigned this Dec 13, 2018
@mati865
Copy link
Contributor

mati865 commented Feb 25, 2019

The patch has landed in Rust's LLVM.

@nikic
Copy link
Contributor

nikic commented Jul 18, 2019

The original issue here is fixed, but LLVM is still doing something slightly weird here with the four consecutive ud2 instructions. IR for reference: https://rust.godbolt.org/z/5vRhI1

@nikic
Copy link
Contributor

nikic commented May 23, 2020

This is long since resolved, and the redundant ud2s are no longer present either.

@nikic nikic closed this as completed May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants