Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

128-bit integer division with remainder is not combined to a single operation #44545

Open
henninglive opened this issue Sep 13, 2017 · 7 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@henninglive
Copy link

henninglive commented Sep 13, 2017

Rustc fails to combine div() and mod() operations with u128 to a single division with remainder on x64. The following code compiles down to separate calls to __udivti3() and __umodti3() when we would prefer a single call to __udivmodti4().

#![feature(i128_type, i128, test)]
extern crate test;

#[inline(never)]
fn divmod_u128(a: u128, b: u128) -> (u128, u128) {
    (a / b, a % b)
}

fn main() {
    println!("{:?}", divmod_u128(std::u128::MAX, test::black_box(1000)));
}

Related to #39078.
Relevant in the Display implementation for i128 and u128.

@kennytm
Copy link
Member

kennytm commented Sep 13, 2017

Note that the equivalent program in C:

typedef struct {
  unsigned __int128 a;
  unsigned __int128 b;
} UInt128Pair;

UInt128Pair divmod(unsigned __int128 a, unsigned __int128 b) {
  UInt128Pair result;
  result.a = a / b;
  result.b = a % b;
  return result;
}

Does not generate __udivmodti4 either with clang 5:

x86_64 assembly
divmod:                                 # @divmod
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        push    rax
        mov     r14, r8
        mov     r15, rcx
        mov     r12, rdx
        mov     r13, rsi
        mov     rbx, rdi

        mov     rdi, r13
        mov     rsi, r12
        mov     rdx, r15
        mov     rcx, r14
        call    __udivti3
        mov     qword ptr [rsp], rax    # 8-byte Spill
        mov     rbp, rdx

        mov     rdi, r13
        mov     rsi, r12
        mov     rdx, r15
        mov     rcx, r14
        call    __umodti3

        mov     qword ptr [rbx + 8], rbp
        mov     rcx, qword ptr [rsp]    # 8-byte Reload
        mov     qword ptr [rbx], rcx
        mov     qword ptr [rbx + 24], rdx
        mov     qword ptr [rbx + 16], rax

        mov     rax, rbx
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret

 

clang (trunk) does not call __umodti3, but computes it using a % b == a - (a / b) * b 😰, which behavior we will likely pick up when upgrading to LLVM 5.0.1 or 6.0.0.

x86_64 assembly
divmod:                                 # @divmod
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        mov     r14, r8
        mov     rbx, rcx
        mov     r15, rdx
        mov     r12, rsi
        mov     r13, rdi

        mov     rdi, r12
        mov     rsi, r15
        mov     rdx, rbx
        mov     rcx, r14
        call    __udivti3
        mov     rcx, rax
        mov     rsi, rdx

        imul    r14, rcx
        mov     rax, rcx
        mul     rbx
        add     rdx, r14
        imul    rbx, rsi
        add     rbx, rdx
        sub     r12, rax
        sbb     r15, rbx

        mov     qword ptr [r13 + 8], rsi
        mov     qword ptr [r13], rcx
        mov     qword ptr [r13 + 16], r12
        mov     qword ptr [r13 + 24], r15

        mov     rax, r13
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret

 

ICC 17 also won't combine the calls. GCC 7 does generate __udivmodti4. (GCC 6 still result in two function calls.)

@est31
Copy link
Member

est31 commented Sep 13, 2017

The saddest part of the story is probably the implementations of __udivti3 and __umodti3 both call __udivmodti4 internally...

I guess its more up to LLVM to fix the issue than to Rust, as from what I saw in the LLVM IR specification there is no combined "divrem" LLVM instruction to give modulo and the quotient.

@henninglive
Copy link
Author

That’s disappointing. Is it possible to manually call __udivmodti4()?

@kennytm
Copy link
Member

kennytm commented Sep 13, 2017

That would be rust-lang/rfcs#914.

@est31
Copy link
Member

est31 commented Sep 13, 2017

@henninglive sort of you could. My first impression: it would be more of a hack... Idk, what do others think about this?

@Mark-Simulacrum Mark-Simulacrum added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 17, 2017
@nikic
Copy link
Contributor

nikic commented Oct 31, 2019

Still not generating udivmodti4: https://rust.godbolt.org/z/p-CxYF This is doing udivti3 + mul + sub.

@AaronKutch
Copy link
Contributor

danlark1 opened an LLVM issue for this.

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants