Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible optimization opportunity of integer multiplication by constant #75119

Closed
MineCake147E opened this issue Sep 6, 2022 · 20 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Milestone

Comments

@MineCake147E
Copy link
Contributor

Description

Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.

Example: Multiplying by 7

C#

using System;
public static class C
{
    public static ulong M7(ulong value) => value * 7;
}
C.M7(UInt64)
    L0000: imul rax, rcx, 7
    L0004: ret

C++ with Clang 16.0.0

#include <cstdint>

uint64_t M7(uint64_t value) { return value * 7; }
M7(unsigned long):                                 # @M7(unsigned long)
        lea     rax, [8*rdi]
        sub     rax, rdi
        ret

Configuration

For unsigned integer constants less than 127 that is optimized by LLVM with no imul instruction:
SharpLab
Compiler Explorer

Regression?

No

Data

Analysis

imul instruction with 8bit/32bit immediate value

This is what we are trying to avoid. uops.info says imul instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.

The way of multiplying integer with constant integer without imul instruction

Advanced lea sequence

lea stands for 'Load Effective Address'. x86 lea instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.
Formulas for Effective Addresses can be one of:

  • $Displacement$
  • $Base + Displacement$
  • $(Index * Scale) + Displacement$
  • $Base + (Index * Scale) + Displacement$

The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.

Using a single lea instruction, we can multiply integers with some constant integers like 3, 5, and 9.
Current RyuJIT can handle this situation correctly:

using System;
public static class C
{
    public static ulong M3(ulong value) => value * 3;
    public static ulong M5(ulong value) => value * 5;
    public static ulong M9(ulong value) => value * 9;
}
; Core CLR 6.0.722.32202 on amd64
## value is in rcx
C.M3(UInt64)
    L0000: lea rax, [rcx+rcx*2]
    L0004: ret

C.M5(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: ret

C.M9(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: ret

But we can do even more.
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:

#include <cstdint>

uint64_t M11(uint64_t value) { return value * 11; }
uint64_t M15(uint64_t value) { return value * 15; }
uint64_t M19(uint64_t value) { return value * 19; }
## value is in rdi
M11(unsigned long):                                # @M11(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (5 * rdi) * 2 = rdi * 11
M15(unsigned long):                                # @M15(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rax + 2*rax]                 # rax = rax * 3
        ret                                        # now rax is (5 * rdi) * 3 = rdi * 15
M19(unsigned long):                                # @M19(unsigned long)
        lea     rax, [rdi + 8*rdi]                 # rax = 9 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (9 * rdi) * 2 = rdi * 19
using System;
public static class C
{
    public static ulong M11(ulong value) => value * 11;
    public static ulong M15(ulong value) => value * 15;
    public static ulong M19(ulong value) => value * 19;
}
; Core CLR 6.0.722.32202 on amd64

C.M11(UInt64)
    L0000: imul rax, rcx, 0xb   # It blindly executes `imul` instruction...
    L0004: ret

C.M15(UInt64)
    L0000: imul rax, rcx, 0xf   # It blindly executes `imul` instruction...
    L0004: ret

C.M19(UInt64)
    L0000: imul rax, rcx, 0x13  # It blindly executes `imul` instruction...
    L0004: ret

We can observe LLVM is doing something new. The $Base + (Index * Scale) + Displacement$ pattern is used twice, with Displacement set to 0, and Base and Index are different sometimes. We can multiply an integer by either 1, 2, 4, or 8, then add with another integer.

shl followed by lea

lea instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.
Let's see how two compilers optimize multiplication by 66, 68, and 70:

using System;
public static class C
{
    public static ulong M66(ulong value) => value * 66;
    public static ulong M68(ulong value) => value * 68;
    public static ulong M70(ulong value) => value * 70;
}
; Core CLR 6.0.722.32202 on amd64

C.M66(UInt64)
    L0000: imul rax, rcx, 0x42      # Needless to say, it blindly executes `imul` instruction.
    L0004: ret

C.M68(UInt64)
    L0000: imul rax, rcx, 0x44
    L0004: ret

C.M70(UInt64)
    L0000: imul rax, rcx, 0x46
    L0004: ret
#include <cstdint>

uint64_t M66(uint64_t value) { return value * 66; }
uint64_t M68(uint64_t value) { return value * 68; }
uint64_t M70(uint64_t value) { return value * 70; }
M66(unsigned long):                                # @M66(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 2*rdi]                 # rax = rax + rdi * 2
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 66
M68(unsigned long):                                # @M68(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 4*rdi]                 # rax = rax + rdi * 4
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 68
M70(unsigned long):                                # @M70(unsigned long)
        imul    rax, rdi, 70                       # IT EXECUTES `imul` INSTRUCTION!
        ret

shl followed by sub

sub instruction can be used for multiplying integer with power of 2 subtracted by 1.
By executing sub twice, it can be used for multiplying integer with power of 2 subtracted by 2.
Let's see how two compilers optimize multiplication by 62 and 63:

using System;
public static class C
{
    public static ulong M62(ulong value) => value * 62;
    public static ulong M63(ulong value) => value * 63;
}
; Core CLR 6.0.722.32202 on amd64

C.M62(UInt64)
    L0000: imul rax, rcx, 0x3e
    L0004: ret

C.M63(UInt64)
    L0000: imul rax, rcx, 0x3f
    L0004: ret
#include <cstdint>

uint64_t M62(uint64_t value) { return value * 62; }
uint64_t M63(uint64_t value) { return value * 63; }
M62(unsigned long):                                # @M62(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        sub     rax, rdi
        ret
M63(unsigned long):                                # @M63(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        ret

Multiplying by power of 2

When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:

using System;
public static class C
{
    public static ulong M6(ulong value) => value * 6;
    public static ulong M20(ulong value) => value * 20;
    public static ulong M72(ulong value) => value * 72;
}
; Core CLR 6.0.722.32202 on amd64

C.M6(UInt64)
    L0000: lea rax, [rcx+rcx*2]     # first multiply by 3,
    L0004: add rax, rax             # then `add rax, rax`
    L0007: ret

C.M20(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: shl rax, 2
    L0008: ret

C.M72(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: shl rax, 3
    L0008: ret
#include <cstdint>

uint64_t M6(uint64_t value) { return value * 6; }
uint64_t M20(uint64_t value) { return value * 20; }
uint64_t M72(uint64_t value) { return value * 72; }
M6(unsigned long):                                 # @M6(unsigned long)
        add     rdi, rdi                           # multiply by 2 first,
        lea     rax, [rdi + 2*rdi]                 # then multiply by 3
        ret
M20(unsigned long):                                # @M20(unsigned long)
        shl     rdi, 2
        lea     rax, [rdi + 4*rdi]
        ret
M72(unsigned long):                                # @M72(unsigned long)
        shl     rdi, 3
        lea     rax, [rdi + 8*rdi]
        ret

Interestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.
Both compilers managed to optimize multiplication by 2 to add rdi, rdi (or whatever adds the same register together), though.

Combining these techniques

By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.

Number RyuJIT LLVM
3 $(rdi + 2 * rdi)$ $(rdi + 2 * rdi)$
5 $(rdi + 4 * rdi)$ $(rdi + 4 * rdi)$
7 (imul) $(8 * rdi) - rdi$
9 $(rdi + 8 * rdi)$ $(rdi + 8 * rdi)$
11 (imul) $rdi + 2 * (5 * rdi)$
13 (imul) $rdi + 4 * (3 * rdi)$
14 (imul) $(16 * rdi) - rdi - rdi$
15 (imul) $(5 * rdi) * 3$
17 (imul) $rdi + 16 * rdi$
19 (imul) $rdi + 2 * (9 * rdi)$
21 (imul) $rdi + 4 * (5 * rdi)$
22 (imul) $(rdi + 4 * (5 * rdi)) + rdi$
23 (imul) $(8 * (3 * rdi)) - rdi$
25 (imul) $5 * (5 * rdi)$
26 (imul) $rdi + 5 * (5 * rdi)$
27 (imul) $3 * (9 * rdi)$
28 (imul) $rdi + 3 * (9 * rdi)$
29 (imul) $rdi + rdi + 3 * (9 * rdi)$
30 (imul) $(32 * rdi) - rdi - rdi$
31 (imul) $(32 * rdi) - rdi$
33 (imul) $(32 * rdi) + rdi$
34 (imul) $(32 * rdi) + 2 * rdi$
37 (imul) $rdi + 4 * (9 * rdi)$
41 (imul) $rdi + 8 * (5 * rdi)$
45 (imul) $5 * (9 * rdi)$
48 (imul) $3 * (4 * rdi)$
62 (imul) $(64 * rdi) - rdi - rdi$
63 (imul) $(64 * rdi) - rdi$
65 (imul) $(64 * rdi) + rdi$
66 (imul) $(64 * rdi) + 2 * rdi$
68 (imul) $(64 * rdi) + 4 * rdi$
73 (imul) $rdi + 8 * (9 * rdi)$
80 (imul) $5 * (8 * rdi)$
81 (imul) $9 * (9 * rdi)$
96 (imul) $3 * (32 * rdi)$
126 (imul) $(128 * rdi) - rdi - rdi$
127 (imul) $(128 * rdi) - rdi$

Any multiplication by power of two can be performed by an shl instruction.
Multiplication by 3, 5, or 9 can be performed by a single lea instruction.

@MineCake147E MineCake147E added the tenet-performance Performance related issue label Sep 6, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 6, 2022
@AntonLapounov AntonLapounov added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 6, 2022
@ghost
Copy link

ghost commented Sep 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.

Example: Multiplying by 7

C#

using System;
public static class C
{
    public static ulong M7(ulong value) => value * 7;
}
C.M7(UInt64)
    L0000: imul rax, rcx, 7
    L0004: ret

C++ with Clang 16.0.0

#include <cstdint>

uint64_t M7(uint64_t value) { return value * 7; }
M7(unsigned long):                                 # @M7(unsigned long)
        lea     rax, [8*rdi]
        sub     rax, rdi
        ret

Configuration

For unsigned integer constants less than 127 that is optimized by LLVM with no imul instruction:
SharpLab
Compiler Explorer

Regression?

No

Data

Analysis

imul instruction with 8bit/32bit immediate value

This is what we are trying to avoid. uops.info says imul instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.

The way of multiplying integer with constant integer without imul instruction

Advanced lea sequence

lea stands for 'Load Effective Address'. x86 lea instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.
Formulas for Effective Addresses can be one of:

  • $Displacement$
  • $Base + Displacement$
  • $(Index * Scale) + Displacement$
  • $Base + (Index * Scale) + Displacement$

The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.

Using a single lea instruction, we can multiply integers with some constant integers like 3, 5, and 9.
Current RyuJIT can handle this situation correctly:

using System;
public static class C
{
    public static ulong M3(ulong value) => value * 3;
    public static ulong M5(ulong value) => value * 5;
    public static ulong M9(ulong value) => value * 9;
}
; Core CLR 6.0.722.32202 on amd64
## value is in rcx
C.M3(UInt64)
    L0000: lea rax, [rcx+rcx*2]
    L0004: ret

C.M5(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: ret

C.M9(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: ret

But we can do even more.
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:

#include <cstdint>

uint64_t M11(uint64_t value) { return value * 11; }
uint64_t M15(uint64_t value) { return value * 15; }
uint64_t M19(uint64_t value) { return value * 19; }
## value is in rdi
M11(unsigned long):                                # @M11(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (5 * rdi) * 2 = rdi * 11
M15(unsigned long):                                # @M15(unsigned long)
        lea     rax, [rdi + 4*rdi]                 # rax = 5 * rdi
        lea     rax, [rax + 2*rax]                 # rax = rax * 3
        ret                                        # now rax is (5 * rdi) * 3 = rdi * 15
M19(unsigned long):                                # @M19(unsigned long)
        lea     rax, [rdi + 8*rdi]                 # rax = 9 * rdi
        lea     rax, [rdi + 2*rax]                 # rax = rdi + rax * 2
        ret                                        # now rax is rdi + (9 * rdi) * 2 = rdi * 19
using System;
public static class C
{
    public static ulong M11(ulong value) => value * 11;
    public static ulong M15(ulong value) => value * 15;
    public static ulong M19(ulong value) => value * 19;
}
; Core CLR 6.0.722.32202 on amd64

C.M11(UInt64)
    L0000: imul rax, rcx, 0xb   # It blindly executes `imul` instruction...
    L0004: ret

C.M15(UInt64)
    L0000: imul rax, rcx, 0xf   # It blindly executes `imul` instruction...
    L0004: ret

C.M19(UInt64)
    L0000: imul rax, rcx, 0x13  # It blindly executes `imul` instruction...
    L0004: ret

We can observe LLVM is doing something new. The $Base + (Index * Scale) + Displacement$ pattern is used twice, with Displacement set to 0, and Base and Index are different sometimes. We can multiply an integer by either 1, 2, 4, or 8, then add with another integer.

shl followed by lea

lea instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.
Let's see how two compilers optimize multiplication by 66, 68, and 70:

using System;
public static class C
{
    public static ulong M66(ulong value) => value * 66;
    public static ulong M68(ulong value) => value * 68;
    public static ulong M70(ulong value) => value * 70;
}
; Core CLR 6.0.722.32202 on amd64

C.M66(UInt64)
    L0000: imul rax, rcx, 0x42      # Needless to say, it blindly executes `imul` instruction.
    L0004: ret

C.M68(UInt64)
    L0000: imul rax, rcx, 0x44
    L0004: ret

C.M70(UInt64)
    L0000: imul rax, rcx, 0x46
    L0004: ret
#include <cstdint>

uint64_t M66(uint64_t value) { return value * 66; }
uint64_t M68(uint64_t value) { return value * 68; }
uint64_t M70(uint64_t value) { return value * 70; }
M66(unsigned long):                                # @M66(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 2*rdi]                 # rax = rax + rdi * 2
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 66
M68(unsigned long):                                # @M68(unsigned long)
        mov     rax, rdi
        shl     rax, 6                             # rax = rdi * 64
        lea     rax, [rax + 4*rdi]                 # rax = rax + rdi * 4
        ret                                        # now rax is rdi * 64 + rdi * 2 = rdi * 68
M70(unsigned long):                                # @M70(unsigned long)
        imul    rax, rdi, 70                       # IT EXECUTES `imul` INSTRUCTION!
        ret

shl followed by sub

sub instruction can be used for multiplying integer with power of 2 subtracted by 1.
By executing sub twice, it can be used for multiplying integer with power of 2 subtracted by 2.
Let's see how two compilers optimize multiplication by 62 and 63:

using System;
public static class C
{
    public static ulong M62(ulong value) => value * 62;
    public static ulong M63(ulong value) => value * 63;
}
; Core CLR 6.0.722.32202 on amd64

C.M62(UInt64)
    L0000: imul rax, rcx, 0x3e
    L0004: ret

C.M63(UInt64)
    L0000: imul rax, rcx, 0x3f
    L0004: ret
#include <cstdint>

uint64_t M62(uint64_t value) { return value * 62; }
uint64_t M63(uint64_t value) { return value * 63; }
M62(unsigned long):                                # @M62(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        sub     rax, rdi
        ret
M63(unsigned long):                                # @M63(unsigned long)
        mov     rax, rdi
        shl     rax, 6
        sub     rax, rdi
        ret

Multiplying by power of 2

When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:

using System;
public static class C
{
    public static ulong M6(ulong value) => value * 6;
    public static ulong M20(ulong value) => value * 20;
    public static ulong M72(ulong value) => value * 72;
}
; Core CLR 6.0.722.32202 on amd64

C.M6(UInt64)
    L0000: lea rax, [rcx+rcx*2]     # first multiply by 3,
    L0004: add rax, rax             # then `add rax, rax`
    L0007: ret

C.M20(UInt64)
    L0000: lea rax, [rcx+rcx*4]
    L0004: shl rax, 2
    L0008: ret

C.M72(UInt64)
    L0000: lea rax, [rcx+rcx*8]
    L0004: shl rax, 3
    L0008: ret
#include <cstdint>

uint64_t M6(uint64_t value) { return value * 6; }
uint64_t M20(uint64_t value) { return value * 20; }
uint64_t M72(uint64_t value) { return value * 72; }
M6(unsigned long):                                 # @M6(unsigned long)
        add     rdi, rdi                           # multiply by 2 first,
        lea     rax, [rdi + 2*rdi]                 # then multiply by 3
        ret
M20(unsigned long):                                # @M20(unsigned long)
        shl     rdi, 2
        lea     rax, [rdi + 4*rdi]
        ret
M72(unsigned long):                                # @M72(unsigned long)
        shl     rdi, 3
        lea     rax, [rdi + 8*rdi]
        ret

Interestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.
Both compilers managed to optimize multiplication by 2 to add rdi, rdi (or whatever adds the same register together), though.

Combining these techniques

By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.

Number RyuJIT LLVM
3 $(rdi + 2 * rdi)$ $(rdi + 2 * rdi)$
5 $(rdi + 4 * rdi)$ $(rdi + 4 * rdi)$
7 (imul) $(8 * rdi) - rdi$
9 $(rdi + 8 * rdi)$ $(rdi + 8 * rdi)$
11 (imul) $rdi + 2 * (5 * rdi)$
13 (imul) $rdi + 4 * (3 * rdi)$
14 (imul) $(16 * rdi) - rdi - rdi$
15 (imul) $(5 * rdi) * 3$
17 (imul) $rdi + 16 * rdi$
19 (imul) $rdi + 2 * (9 * rdi)$
21 (imul) $rdi + 4 * (5 * rdi)$
22 (imul) $(rdi + 4 * (5 * rdi)) + rdi$
23 (imul) $(8 * (3 * rdi)) - rdi$
25 (imul) $5 * (5 * rdi)$
26 (imul) $rdi + 5 * (5 * rdi)$
27 (imul) $3 * (9 * rdi)$
28 (imul) $rdi + 3 * (9 * rdi)$
29 (imul) $rdi + rdi + 3 * (9 * rdi)$
30 (imul) $(32 * rdi) - rdi - rdi$
31 (imul) $(32 * rdi) - rdi$
33 (imul) $(32 * rdi) + rdi$
34 (imul) $(32 * rdi) + 2 * rdi$
37 (imul) $rdi + 4 * (9 * rdi)$
41 (imul) $rdi + 8 * (5 * rdi)$
45 (imul) $5 * (9 * rdi)$
48 (imul) $3 * (4 * rdi)$
62 (imul) $(64 * rdi) - rdi - rdi$
63 (imul) $(64 * rdi) - rdi$
65 (imul) $(64 * rdi) + rdi$
66 (imul) $(64 * rdi) + 2 * rdi$
68 (imul) $(64 * rdi) + 4 * rdi$
73 (imul) $rdi + 8 * (9 * rdi)$
80 (imul) $5 * (8 * rdi)$
81 (imul) $9 * (9 * rdi)$
96 (imul) $3 * (32 * rdi)$
126 (imul) $(128 * rdi) - rdi - rdi$
127 (imul) $(128 * rdi) - rdi$

Any multiplication by power of two can be performed by an shl instruction.
Multiplication by 3, 5, or 9 can be performed by a single lea instruction.

Author: MineCake147E
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo EgorBo added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors and removed untriaged New issue has not been triaged by the area owner labels Sep 6, 2022
@EgorBo EgorBo added this to the Future milestone Sep 6, 2022
@EgorBo
Copy link
Member

EgorBo commented Sep 6, 2022

Thanks, it'd be nice to know the exact performance difference
Anyway, sounds like a good and relatively simple task for new jit contributors

@hopperpl
Copy link

hopperpl commented Sep 6, 2022

This could be very tricky for performance analysis.

Usually, the CPU nowadays has multiple ALU "types" with one focusing on Shifting and the other on Multiplication. While both allow Addition. On top is the memory address calculation "ALU" used by LEA. The most performance is achieved if one distributes instructions on the most ALU sub units possible.

  • imul nowadays has a 2 cycle latency with 1 cycle throughput

So, replacing any imul with more than 2-instructions should be slower (in theory). And the JIT must then also account for other lea instructions or Shift instruction "around" that imul. For example, if the result of that imul instruction is not needed in the next instruction, and there are other independent Shift instructions, then replacing imul by shl adds more pressure on the Shifter ALU sub unit while the Multiply ALU sub unit idles. Similar would apply to memory read/write operations.

In my experience (which is 100% subjective), replacing imul by lea or shl only gives benefits with 1 replacement instruction and there are no other similar instructions "near by". The CPU opcode dispatcher is really good at doing that. I would only do such optimizations if the CPU cannot do it, like division by inverse-multiplication. Or if multiple instructions can be combined into a single one (e.g. C= A << 2 | B --> lea C, [A * 4 + B] w/ B ∈ [0..3]) and otherwise let the CPU reorder and rearrange instructions to fit best)

The same btw applies to ARM and x64. I did cycle performance analysis on ARM 2 years ago and was able to show that ARM has 2 ALUs with a shifter and a multiplier, one in each, that can work in parallel.

@danmoseley
Copy link
Member

Cc @tannergooding

@tannergooding
Copy link
Member

I think hopperpl's analysis is pretty spot on and while this might provide improvements in some cases, it might also make it "worse" in others.

The TL;DR is I do expect there are a couple more scenarios where we could filter this down to "two instructions", much like we do for * 6 today. I don't expect any scenario where we'd change to 3 or more instructions would be better as that means more space, more decoding, and longer dependency chains that prevent other ops from also dispatching for something which takes the same or more time than imul.


imul reg, reg, imm is a decently fast instruction. Even going back to Haswell (~2013), which is the AVX2 baseline, we're looking at a 3 cycle, 1 micro-op, 1 reciprocal throughput instruction that can dispatch on port 1. Newer hardware may be even better.

lea can be faster, but comes with costs/considerations, such as how many "parts" of the addressing mode capabilities are used. The have the option of encoding base, index/scale, and offset. Only the variants that take "one/two source operands" are actually "better" and can execute in 1 cycle, 1 micro-op, 0.5 reciprocal throughput with dispatch on port 1 or 5. The variants that take all three source operands are the same as imul and are 3 cycle, 1 micro-op, 1 reciprocal throughput instructions that can only dispatch on port 1. This also impacts certain scenarios on a micro-architectural basis such as when base or index is EBP/RBP or R13, when using 16-bit addressing mode, or when using RIP-relative addressing.

You then have to consider the impact of these in relation to pipelining, code size, etc where LLVM has a lot of metadata and heuristics around "instruction selection" and will often go and generate the "idealized" sequence for any given scenario, even if that scenario will only be hit in less than 1% of code in practice.

What this often means is that LLVM codegen "looks good" even for leaf methods (methods which don't call anything else). This really makes small inlinable methods doing simple operations with some constant operands or small types look great. However, in practice, these functions will be inlined into the callsite and the codegen will often look completely different because you start having to take into account the surrounding code, the fact that small types incur partial mutation penalties, the fact that various instructions may have to compete with other instructions for ports, etc.

The JIT, in comparison, doesn't have the time to do all this analysis in depth and most of the time it won't provide a substantially measured benefit as compared to other higher level optimizations. This makes it harder to know when making the switch is "better".

@TIHan
Copy link
Contributor

TIHan commented Oct 12, 2022

Here is the LLVM implementation of these optimizations:
https://raw.githubusercontent.com/llvm/llvm-project/main/llvm/lib/Target/X86/X86ISelLowering.cpp
Look for combineMul* functions.

@TIHan
Copy link
Contributor

TIHan commented Oct 12, 2022

Here is the rationale from the AMD docs regarding imul and its replacement sequences:

A 32-bit integer multiplied by a constant has a latency of 3 cycles; a 64-bit integer multiplied by a 
constant has a latency of 4 cycles. For certain constant multipliers, instruction sequences can be 
devised that accomplish the multiplication with lower latency. Because AMD Family 15h processors 
contain only one integer multiplier but up to four integer execution units, the replacement code can 
provide better throughput as well.

Most replacement sequences require the use of an additional temporary register, thus increasing 
register pressure. If register pressure in a piece of code that performs integer multiplication with a 
constant is already high, it could be better for the overall performance of that code to use the IMUL 
instruction instead of the replacement code. Similarly, replacement sequences with low latency but 
containing many instructions may negatively influence decode bandwidth as compared to the IMUL 
instruction. In general, replacement sequences containing more than four instructions are not 
recommended.

The following code samples are designed for the original source to receive the final result. Other 
sequences are possible if the result is in a different register. Sequences that do not require a temporary 
register are favored over those requiring a temporary register, even if the latency is higher. To keep 
code size small, arithmetic-logic-unit operations are preferred over shifts. Similarly, both arithmetic�logic-unit operations and shifts are favored over the LEA instruction.

There are improvements in the AMD Family 15h processors’ multiplier over that of previous x86 
processors. For this reason, when doing 32-bit multiplication, only use the alternative sequence if the 
alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit multiplication, only 
use the alternative sequence if the alternative sequence has a latency that is less than or equal to 
3 cycles.

@hopperpl
Copy link

hopperpl commented Oct 12, 2022

This describes the AMD Family 15h from 2011, among them the Bulldozer, Piledriver, Excavator - which were all very slow in performance compared to other architectures. The architecture was then finally fully overhauled and replaced by the Zen architecture which now is commonly known as Ryzen and Epyc processors.

Based on the latest AMD recommendation:

The integer multiply unit can handle multiplies of up to 64 bits × 64 bits with 3 cycle latency, fully pipelined.

Fully pipeline here means, the instruction scheduler can issue an integer multiplication every cycle, but each multiplication result is available only after the 3rd cycle. The CPU meanwhile will execute other independent instructions out-of-order, or due to SMT will execute "co-thread" instructions. That is if the multiplication result is been required immediately in the next instruction. Furthermore, because of out-of-order, the multiplication would have been executed ahead of time, if the last 2 prior instructions were not also multiplications.

AMD has removed all recommendations for 64-bit constant multiplication replacement sequences from the latest optimization guides and only notes about the possibility to use lea/shift instead. Similarly, so has Intel in their optimization guides.

Throughput satisfies 1 multiplication per cycle, only latency can be an issue. But latency analysis is very, very hard to do as this requires full analysis of all available execution units in the currently running processor, an analysis of all surrounding instructions to account for out-of-order execution, and on top an analysis of what the "co-thread" is doing when the CPU core is SMT enabled. (SMT means, every CPU core runs 2 execution threads, cycle-alternating, and when one thread stalls due to instruction dependency or memory wait, the other thread continues executing instructions instead).

One final quote from the old optimization guides for 15h or lower architectures

For 64-bit multiplication, only use the alternative sequence if the alternative sequence has a latency that is less
than or equal to 3 cycles.

... which, based on improvements in multiplication in the Zen architectures (latency from 4,5 down to 3), is now less or equal to 2 cycles latency.


Try it yourself:

  • use Compiler Explorer at https://gcc.godbolt.org/
  • choose x86-64 gcc (trunk)
  • set parameter -O9 -march=znver1
    • znver1 is Zen 1
  • use the following code
#include <stdint.h>

int64_t multiply(int64_t num) {
    return num * 62;
}

the result is:

multiply(long):
        imul    rax, rdi, 62
        ret

now remove the -march=znver1 and you get:

multiply(long):
        mov     rax, rdi
        sal     rax, 5
        sub     rax, rdi
        add     rax, rax
        ret

Let's make this more realistic with many multiplications:

#include <stdint.h>

int multiply(int64_t num) {
    int64_t a = num * 30;
    int64_t b = num * 62;
    int64_t c = num * 126;
    int64_t d = num * 254;
    int64_t e = num * 510;

    return a ^ b ^ c ^ d ^ e; // prevents inter-line optimization
}

-O9

multiply(long):
        mov     rdx, rdi
        mov     rax, rdi
        sal     rdx, 4                  #1
        sal     rax, 5                  #2
        sub     rdx, rdi                #3
        sub     rax, rdi                #4
        xor     eax, edx
        mov     rdx, rdi
        sal     rdx, 6                  #5
        sub     rdx, rdi                #6
        xor     eax, edx
        mov     rdx, rdi
        sal     rdx, 7                  #7
        sub     rdx, rdi                #8
        xor     eax, edx
        mov     rdx, rdi
        sal     rdx, 8                  #9
        sub     rdx, rdi                #10
        xor     eax, edx
        add     eax, eax                #11 -- nice trick, all multiplications are even so 2 was factored
        ret

not counting XOR, as well as all MOV instructions (those are NOPs), we have 11 operations.

-O9 -march=znver1

multiply(long):
        imul    rdx, rdi, 62            #1
        imul    rax, rdi, 30            #2
        imul    rcx, rdi, 254           #3
        xor     eax, edx                
        imul    rdx, rdi, 126           #4
        imul    rdi, rdi, 510           #5
        xor     edx, ecx
        xor     eax, edx
        xor     eax, edi
        ret

not counting XOR again, we have 5 operations.

3 out of 5 multiplications are used after latency cycles, only the 2nd and 4th multiplications stalls the 1st/2nd XOR instruction by 1 cycle. But out-of-order will execute the following multiplication instead. The final XOR is executed when the last multiplication result is ready.

Which version one will run faster? This is extremely hard to tell. We have fighting shift and subtract instructions and a lot of dependencies on registers. Overall I think the 2nd is faster as the ALU unit idles more and can execute other instructions following.

edit-notes:

  • i made the mistake and let multiply return int of int64_t but that has no effect on instructions, only extends registers to 64-bit
  • on Zen3 the latency is down to 2, all stalls vanish

I cannot make a decision if it's worth to implement micro optimizations for very old architectures (over 10 years). But in my opinion, if utmost performance is crucial, then you do not run code on such old hardware and instead replace it.

@TIHan
Copy link
Contributor

TIHan commented Oct 12, 2022

Thank you for this analysis @hopperpl . Do you have a link to the latest documentation from AMD?

@hopperpl
Copy link

The only publicly available guides for Zen 1/2/3 are 55723, 56305, and 56665. The latter are for EPYC variants but that has no relevance as each Zen architecture uses the same core among all Ryzen, Epyc variants. The difference is merely in I/O and L-cache.

For the non-public version you have to file a request with AMD.

https://www.amd.com/system/files/TechDocs/55723_3_01_0.zip
https://www.amd.com/system/files/TechDocs/56305_SOG_3.00_PUB.pdf
https://www.amd.com/system/files/TechDocs/56665.zip

@tannergooding
Copy link
Member

I cannot make a decision if it's worth to implement micro optimizations for very old architectures (over 10 years). But in my opinion, if utmost performance is crucial, then you do not run code on such old hardware and instead replace it.

Unfortunately, 10 years being "very old" isn't so clear cut. In the enterprise world, this is generally the case but, in the consumer, school, and small-business markets you can often expect a decent percentage of users to be on this kind of hardware. Steam Hardware Survey reports it at about 10-15% of users.

We need to find a balance between the two of those in our targets. Zen1 is only 5 years old at this point and its likely "too new" of a target to solely base these heuristics on. There are also newer considerations, such as with the Alder Lake E-cores where imul is once again 5 cycles, rather than 3.

I imagine the goal of 2 instruction replacement for all cases and 3 instruction replacement for 64-bit cases to be a reasonable goal. There may be some other cases where its worth playing around with and benchmarking, but we need to be careful and consideration of those.

For the non-public version you have to file a request with AMD.

We typically try and avoid private documents as it significantly hinders our ability to use and refer to it in an OSS project. Ideally all references are publicly available under https://www.amd.com/en/support/tech-docs or https://developer.amd.com/resources/developer-guides-manuals/. For Intel this is generally https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

@svick
Copy link
Contributor

svick commented Oct 12, 2022

@tannergooding At what point does it become worthwhile for the JIT to generate different code based on the architecture of the CPU it's running on? (Of course, there still has to be some default for AOT.)

@tannergooding
Copy link
Member

tannergooding commented Oct 12, 2022

We already do that to some extent for SIMD (especially when selecting instruction encoding) and various intrinsic library functions providing "core" math functionality (like popcnt, lzcnt, tzcnt, etc).

However, these all add additional complexity to the testing matrix and we ultimately need to be somewhat considerate of that. There are likely some micro-architecture specific optimizations we could take advantage of, but we'd need to show real concrete numbers showing the benefit and that it would be a net negative to not just do such an optimization "everywhere".

@TIHan
Copy link
Contributor

TIHan commented Oct 16, 2022

I tested using the replacement sequence mov,shl,sub against imul on a Ryzen 9 7950X which is one of the most modern CPUs right now; it was released last month.

Here are the benchmarks:

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2


|              Method |      Mean |     Error |    StdDev |
|-------------------- |----------:|----------:|----------:|
|                imul |  5.291 ms | 0.0133 ms | 0.0125 ms |
|         mov_shl_sub |  3.523 ms | 0.0068 ms | 0.0061 ms |
|        imul_complex | 31.795 ms | 0.1466 ms | 0.1145 ms |
| mov_shl_sub_complex | 26.489 ms | 0.0437 ms | 0.0409 ms |

Benchmark source:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace Perf
{
    public class Multiply
    {
        [Benchmark]
        public void imul()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
                value = value * 7;
        }

        [Benchmark]
        public void mov_shl_sub()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
                value = value * 8 - value;
        }

        [Benchmark]
        public void imul_complex()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
            {
                value = value * 7;
                value = value * 7345;
                value = value * 7;
                value = value * 7345;
                value = value * 7;
                value = value * 7345;
            }
        }

        [Benchmark]
        public void mov_shl_sub_complex()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
            {
                value = value * 8 - value;
                value = value * 7345;
                value = value * 8 - value;
                value = value * 7345;
                value = value * 8 - value;
                value = value * 7345;
            }
        }
    }
    static class Program
    {
        static int Main()
        {
            BenchmarkRunner.Run<Multiply>();
            return 0;
        }
    }
}

Despite the replacement sequence being more instructions and a size increase, it still performs significantly better than imul.

@TIHan
Copy link
Contributor

TIHan commented Oct 16, 2022

Using a four instruction replacement sequence seems to be not worth it though:

|      Method |     Mean |     Error |    StdDev |
|------------ |---------:|----------:|----------:|
|        imul | 5.358 ms | 0.0232 ms | 0.0217 ms |
| mov_shl_sub | 5.378 ms | 0.0401 ms | 0.0375 ms |
        [Benchmark]
        public void imul()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
                value = value * 14;
        }

        [Benchmark]
        public void mov_shl_sub()
        {
            ulong value = 2;
            for (var i = 0; i < 10000000; i++)
                value = value * 16 - value - value;
        }

@hopperpl
Copy link

I ran the identical benchmark on an older Zen 1 and got very similar results.

AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
Method Mean Error StdDev
imul 7.465 ms 0.0010 ms 0.0008 ms
mov_shl_sub 4.974 ms 0.0078 ms 0.0061 ms
imul_complex 44.922 ms 0.1370 ms 0.1214 ms
mov_shl_sub_complex 37.337 ms 0.0485 ms 0.0454 ms

But then I wanted to remove all the dependency chains, because all these benchmarks stall the latency-3 multiplication. So I did some interleaving and rerun the benchmark. Notice that the number of "multiplications" triples. There are now 30 million, instead of the original 10 million.

  [Benchmark]
  public void imul_interleave()
  {
    ulong valueA = 2;
    ulong valueB = 2;
    ulong valueC = 2;
    for (var i = 10000000 ; i != 0 ; i--)
    {
      valueA = valueA * 7;
      valueB = valueB * 7;
      valueC = valueC * 7;
    }
  }

  [Benchmark]
  public void mov_shl_sub_interleave()
  {
    ulong valueA = 2;
    ulong valueB = 2;
    ulong valueC = 2;
    for (var i = 10000000 ; i != 0 ; i--)
    {
      valueA = valueA * 8 - valueA;
      valueB = valueB * 8 - valueB;
      valueC = valueC * 8 - valueC;
    }
  }

Here's the disassembly to very all is as expected:

C.imul_interleave()
    L0000: mov eax, 2
    L0005: mov edx, 2
    L000a: mov ecx, 2
    L000f: mov r8d, 0x989680
    L0015: nop [rax+rax]
    L0020: imul rax, 7
    L0024: imul rdx, 7
    L0028: imul rcx, 7
    L002c: dec r8d
    L002f: jne short L0020
    L0031: ret

C.mov_shl_sub_interleave()
    L0000: mov eax, 2
    L0005: mov edx, 2
    L000a: mov ecx, 2
    L000f: mov r8d, 0x989680
    L0015: mov r9, rax
    L0018: shl r9, 3
    L001c: sub r9, rax
    L001f: mov rax, r9
    L0022: mov r9, rdx
    L0025: shl r9, 3
    L0029: sub r9, rdx
    L002c: mov rdx, r9
    L002f: mov r9, rcx
    L0032: shl r9, 3
    L0036: sub r9, rcx
    L0039: mov rcx, r9
    L003c: dec r8d
    L003f: jne short L0015
    L0041: ret

And here are the results:

Method Mean Error StdDev
imul_interleave 7.459 ms 0.0100 ms 0.0094 ms
mov_shl_sub_interleave 7.470 ms 0.0078 ms 0.0073 ms

This is truly a beautiful result. I love it.

While the number of multiplications tripled in imul vs imul_interleave, the execution time did not change. This beautifully shows the latency-3 of imul. The interleave variant can execute 3 multiplications in parallel as the result is not needed right away.

On top, the number of shift/subtract instruction in mov_shl_sub vs. mov_shl_sub_interleave also tripled. But the multiple ALU units were easily able to fill empty slots. That's why execution time merely went up by 50%.

Furthermore, the CPU clocks on avg at 4.0 GHz. 30 million imul / 4 billion ticks = 0.0075 or 7.5 msec.

Now, what does this all mean? Well, artifical benchmarks are not helpful for real world scenarios. I would still go with sub/shl replacement code, as this benchmark at least shows such operations can fill in many empty slots the ALUs have available.

A 3-operation replacement code should benefit as well, if there is only 1 shift operation involved (not counting lea).

@TIHan
Copy link
Contributor

TIHan commented Oct 17, 2022

The "multiple multiplications without changing execution time" is definitely a great test.

I ran your benchmark with interleaving and this was my result on the 7950X:

|                 Method |     Mean |     Error |    StdDev |
|----------------------- |---------:|----------:|----------:|
|        imul_interleave | 5.375 ms | 0.0167 ms | 0.0140 ms |
| mov_shl_sub_interleave | 4.124 ms | 0.0165 ms | 0.0154 ms |

mov_shl_sub wins this one - but it's not entirely surprising considering the generation gap between the 2700X and 7950X.

I think it's pretty safe to say that doing the optimization is worth it. I'm glad we all went to this depth to vet it and provided data to back it up.

@hopperpl
Copy link

7950X runs at around 5.7 GHz, so with math this is 5.263 ms to be expected. That means multiplication is still at latency-3 in Zen 4. The fact that shl/sub is still faster tells me that Zen 4 has two shift/shuffle units. Zen 1 has only one. Actually, it should be Zen 3, that core got a major integer unit overhaul and a twice-as-fast integer division unit.

Overall, I agree, it will be beneficial to replace.

@TIHan
Copy link
Contributor

TIHan commented Nov 5, 2022

Closing as we have exhausted these optimizations now. As a FYI, we did not implement every single one of the optimizations mentioned here as some are not beneficial anymore on modern hardware.

@TIHan TIHan closed this as completed Nov 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 5, 2022
@EgorBo EgorBo modified the milestones: Future, 8.0.0 Jan 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants