Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#![feature(unchecked_math)] emits slow asm #91449

Closed
JakobDegen opened this issue Dec 2, 2021 · 8 comments
Closed

#![feature(unchecked_math)] emits slow asm #91449

JakobDegen opened this issue Dec 2, 2021 · 8 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@JakobDegen
Copy link
Contributor

This code

#![feature(unchecked_math)]

pub fn sub(x: &u8) -> usize {
    unsafe { x.unchecked_sub(10) as usize }
}

currently emits

example::sub:
        mov     al, byte ptr [rdi]
        add     al, -10
        movzx   eax, al
        ret

which should instead be

example::sub2:
        movzx   eax, byte ptr [rdi]
        add     rax, -10
        ret

I unfortunately do not have the background to check whether its rustc or LLVM that's the cause of the issue; I'd be happy to file a bug against LLVM if it is on that side, but would need some pointers on how best to produce a useful bug report.

cc #85122

@inquisitivecrystal inquisitivecrystal added A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 2, 2021
@workingjubilee
Copy link
Member

#include <stdint.h>

int sub(int8_t x) {
    return x - 10;
}

Godbolt says clang can turn this into

sub:                                    # @sub
        lea     eax, [rdi - 10]
        ret

so I don't think it's an LLVM problem.

@JakobDegen
Copy link
Contributor Author

@workingjubilee

pub fn sub(x: u8) -> u8 {
    x - 10
}

also compiles into that same code, the deref seems to be relevant. Its also rather difficult to write a C equivalent, because unsigned overflow is always well-defined there and also you can't actually do math with 8 bit integers. The closest equivalent is probably

#include <stdint.h>

int sub(int8_t * x) {
    int8_t val = (int8_t) ((*x) - 10);
    if(val < 0){
        __builtin_unreachable();
    }
    return val;
}

which also compiles to the same thing as the Rust version; that being said, the C version should be more difficult for LLVM to reason about, since the subtraction is happening on a wider integer. Really someone should look at the LLVM IR that Rust produces and check if its a valid optimization on the basis of that.

@Patrick-Poitras
Copy link
Contributor

Patrick-Poitras commented Dec 5, 2021

@JakobDegen
I tried a version that derefs and converts to usize, then one that just derefs. It seems that

pub fn sub2(x :&u8) -> usize {
    unsafe {(*x as usize).unchecked_sub(10)}
}

gives the correct, shorter answer, but

pub fn sub3(x :&u8) -> usize {
    unsafe {(*x).unchecked_sub(10) as usize}
}

gives the slower version.

See this comparison on Godbolt.

@JakobDegen
Copy link
Contributor Author

@Patrick-Poitras that's mostly unsurprising, since

pub fn sub2(x: &u8) -> usize {
    (*x as usize) - 10
}

also gives the fast version

@YangKeao
Copy link

It seems that the LLVM doesn't do this optimization. For example, the following two c codes will also have the different result:

#include <stdint.h>

int64_t sub(int8_t* x) {
    int8_t some = *x - 10;
    return some;
}
#include <stdint.h>

int64_t sub(int8_t* x) {
    return *x - 10;
}

The first one will generate llvm-ir:

define dso_local i64 @sub(i8* nocapture readonly %0) local_unnamed_addr #0 {
  %2 = load i8, i8* %0, align 1, !tbaa !3
  %3 = add i8 %2, -10
  %4 = sext i8 %3 to i64
  ret i64 %4
}

While the second one will generate llvm-ir:

define dso_local i64 @sub(i8* nocapture readonly %0) local_unnamed_addr #0 {
  %2 = load i8, i8* %0, align 1, !tbaa !3
  %3 = sext i8 %2 to i64
  %4 = add nsw i64 %3, -10
  ret i64 %4
}

And the second one will get a shorter assembly code, which is unsurprising, as the load and sext will become sextload, and result in MOVSX64rm8 during the instruction selection.

@JakobDegen
Copy link
Contributor Author

@YangKeao I'm not sure I understand the example. Doing the optimization on the first example would be incorrect in C. That computation is actually allowed to overflow, because in C the operands get promoted to ints first before the subtraction happens, and then the result is truncated back when you do the assignment. Note that there's no nsw on the add in the first example.

@JakobDegen
Copy link
Contributor Author

Having more experience with LLVM now than when I wrote this, I've minimized the problem in LLVM and will file a bug report with them

@JakobDegen
Copy link
Contributor Author

As per advice on Zulip I'm closing this, since there's not much Rust can do in any case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants