-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible optimization opportunity of integer multiplication by constant #75119
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionUnlike the LLVM, RyuJIT is reluctant to optimize integer multiplication. Example: Multiplying by 7C#using System;
public static class C
{
public static ulong M7(ulong value) => value * 7;
} C.M7(UInt64)
L0000: imul rax, rcx, 7
L0004: ret C++ with Clang 16.0.0#include <cstdint>
uint64_t M7(uint64_t value) { return value * 7; } M7(unsigned long): # @M7(unsigned long)
lea rax, [8*rdi]
sub rax, rdi
ret ConfigurationFor unsigned integer constants less than 127 that is optimized by LLVM with no Regression?No DataAnalysis
|
Number | RyuJIT | LLVM |
---|---|---|
3 | ||
5 | ||
7 | (imul) | |
9 | ||
11 | (imul) | |
13 | (imul) | |
14 | (imul) | |
15 | (imul) | |
17 | (imul) | |
19 | (imul) | |
21 | (imul) | |
22 | (imul) | |
23 | (imul) | |
25 | (imul) | |
26 | (imul) | |
27 | (imul) | |
28 | (imul) | |
29 | (imul) | |
30 | (imul) | |
31 | (imul) | |
33 | (imul) | |
34 | (imul) | |
37 | (imul) | |
41 | (imul) | |
45 | (imul) | |
48 | (imul) | |
62 | (imul) | |
63 | (imul) | |
65 | (imul) | |
66 | (imul) | |
68 | (imul) | |
73 | (imul) | |
80 | (imul) | |
81 | (imul) | |
96 | (imul) | |
126 | (imul) | |
127 | (imul) |
Any multiplication by power of two can be performed by an shl
instruction.
Multiplication by 3, 5, or 9 can be performed by a single lea
instruction.
Author: | MineCake147E |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | - |
Thanks, it'd be nice to know the exact performance difference |
This could be very tricky for performance analysis. Usually, the CPU nowadays has multiple ALU "types" with one focusing on Shifting and the other on Multiplication. While both allow Addition. On top is the memory address calculation "ALU" used by LEA. The most performance is achieved if one distributes instructions on the most ALU sub units possible.
So, replacing any imul with more than 2-instructions should be slower (in theory). And the JIT must then also account for other lea instructions or Shift instruction "around" that imul. For example, if the result of that imul instruction is not needed in the next instruction, and there are other independent Shift instructions, then replacing imul by shl adds more pressure on the Shifter ALU sub unit while the Multiply ALU sub unit idles. Similar would apply to memory read/write operations. In my experience (which is 100% subjective), replacing imul by lea or shl only gives benefits with 1 replacement instruction and there are no other similar instructions "near by". The CPU opcode dispatcher is really good at doing that. I would only do such optimizations if the CPU cannot do it, like division by inverse-multiplication. Or if multiple instructions can be combined into a single one (e.g. C= A << 2 | B --> lea C, [A * 4 + B] w/ B ∈ [0..3]) and otherwise let the CPU reorder and rearrange instructions to fit best) The same btw applies to ARM and x64. I did cycle performance analysis on ARM 2 years ago and was able to show that ARM has 2 ALUs with a shifter and a multiplier, one in each, that can work in parallel. |
I think hopperpl's analysis is pretty spot on and while this might provide improvements in some cases, it might also make it "worse" in others. The TL;DR is I do expect there are a couple more scenarios where we could filter this down to "two instructions", much like we do for
You then have to consider the impact of these in relation to pipelining, code size, etc where LLVM has a lot of metadata and heuristics around "instruction selection" and will often go and generate the "idealized" sequence for any given scenario, even if that scenario will only be hit in less than 1% of code in practice. What this often means is that LLVM codegen "looks good" even for leaf methods (methods which don't call anything else). This really makes small inlinable methods doing simple operations with some constant operands or small types look great. However, in practice, these functions will be inlined into the callsite and the codegen will often look completely different because you start having to take into account the surrounding code, the fact that small types incur partial mutation penalties, the fact that various instructions may have to compete with other instructions for ports, etc. The JIT, in comparison, doesn't have the time to do all this analysis in depth and most of the time it won't provide a substantially measured benefit as compared to other higher level optimizations. This makes it harder to know when making the switch is "better". |
Here is the LLVM implementation of these optimizations: |
Here is the rationale from the AMD docs regarding
|
This describes the AMD Family 15h from 2011, among them the Bulldozer, Piledriver, Excavator - which were all very slow in performance compared to other architectures. The architecture was then finally fully overhauled and replaced by the Zen architecture which now is commonly known as Ryzen and Epyc processors. Based on the latest AMD recommendation:
Fully pipeline here means, the instruction scheduler can issue an integer multiplication every cycle, but each multiplication result is available only after the 3rd cycle. The CPU meanwhile will execute other independent instructions out-of-order, or due to SMT will execute "co-thread" instructions. That is if the multiplication result is been required immediately in the next instruction. Furthermore, because of out-of-order, the multiplication would have been executed ahead of time, if the last 2 prior instructions were not also multiplications. AMD has removed all recommendations for 64-bit constant multiplication replacement sequences from the latest optimization guides and only notes about the possibility to use lea/shift instead. Similarly, so has Intel in their optimization guides. Throughput satisfies 1 multiplication per cycle, only latency can be an issue. But latency analysis is very, very hard to do as this requires full analysis of all available execution units in the currently running processor, an analysis of all surrounding instructions to account for out-of-order execution, and on top an analysis of what the "co-thread" is doing when the CPU core is SMT enabled. (SMT means, every CPU core runs 2 execution threads, cycle-alternating, and when one thread stalls due to instruction dependency or memory wait, the other thread continues executing instructions instead). One final quote from the old optimization guides for 15h or lower architectures
... which, based on improvements in multiplication in the Zen architectures (latency from 4,5 down to 3), is now less or equal to 2 cycles latency. Try it yourself:
#include <stdint.h>
int64_t multiply(int64_t num) {
return num * 62;
} the result is: multiply(long):
imul rax, rdi, 62
ret now remove the -march=znver1 and you get: multiply(long):
mov rax, rdi
sal rax, 5
sub rax, rdi
add rax, rax
ret Let's make this more realistic with many multiplications: #include <stdint.h>
int multiply(int64_t num) {
int64_t a = num * 30;
int64_t b = num * 62;
int64_t c = num * 126;
int64_t d = num * 254;
int64_t e = num * 510;
return a ^ b ^ c ^ d ^ e; // prevents inter-line optimization
} -O9 multiply(long):
mov rdx, rdi
mov rax, rdi
sal rdx, 4 #1
sal rax, 5 #2
sub rdx, rdi #3
sub rax, rdi #4
xor eax, edx
mov rdx, rdi
sal rdx, 6 #5
sub rdx, rdi #6
xor eax, edx
mov rdx, rdi
sal rdx, 7 #7
sub rdx, rdi #8
xor eax, edx
mov rdx, rdi
sal rdx, 8 #9
sub rdx, rdi #10
xor eax, edx
add eax, eax #11 -- nice trick, all multiplications are even so 2 was factored
ret not counting XOR, as well as all MOV instructions (those are NOPs), we have 11 operations. -O9 -march=znver1 multiply(long):
imul rdx, rdi, 62 #1
imul rax, rdi, 30 #2
imul rcx, rdi, 254 #3
xor eax, edx
imul rdx, rdi, 126 #4
imul rdi, rdi, 510 #5
xor edx, ecx
xor eax, edx
xor eax, edi
ret not counting XOR again, we have 5 operations. 3 out of 5 multiplications are used after latency cycles, only the 2nd and 4th multiplications stalls the 1st/2nd XOR instruction by 1 cycle. But out-of-order will execute the following multiplication instead. The final XOR is executed when the last multiplication result is ready. Which version one will run faster? This is extremely hard to tell. We have fighting shift and subtract instructions and a lot of dependencies on registers. Overall I think the 2nd is faster as the ALU unit idles more and can execute other instructions following. edit-notes:
I cannot make a decision if it's worth to implement micro optimizations for very old architectures (over 10 years). But in my opinion, if utmost performance is crucial, then you do not run code on such old hardware and instead replace it. |
Thank you for this analysis @hopperpl . Do you have a link to the latest documentation from AMD? |
The only publicly available guides for Zen 1/2/3 are 55723, 56305, and 56665. The latter are for EPYC variants but that has no relevance as each Zen architecture uses the same core among all Ryzen, Epyc variants. The difference is merely in I/O and L-cache. For the non-public version you have to file a request with AMD. https://www.amd.com/system/files/TechDocs/55723_3_01_0.zip |
Unfortunately, 10 years being "very old" isn't so clear cut. In the enterprise world, this is generally the case but, in the consumer, school, and small-business markets you can often expect a decent percentage of users to be on this kind of hardware. Steam Hardware Survey reports it at about 10-15% of users. We need to find a balance between the two of those in our targets. Zen1 is only 5 years old at this point and its likely "too new" of a target to solely base these heuristics on. There are also newer considerations, such as with the I imagine the goal of 2 instruction replacement for all cases and 3 instruction replacement for 64-bit cases to be a reasonable goal. There may be some other cases where its worth playing around with and benchmarking, but we need to be careful and consideration of those.
We typically try and avoid private documents as it significantly hinders our ability to use and refer to it in an OSS project. Ideally all references are publicly available under https://www.amd.com/en/support/tech-docs or https://developer.amd.com/resources/developer-guides-manuals/. For Intel this is generally https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html |
@tannergooding At what point does it become worthwhile for the JIT to generate different code based on the architecture of the CPU it's running on? (Of course, there still has to be some default for AOT.) |
We already do that to some extent for SIMD (especially when selecting instruction encoding) and various intrinsic library functions providing "core" math functionality (like popcnt, lzcnt, tzcnt, etc). However, these all add additional complexity to the testing matrix and we ultimately need to be somewhat considerate of that. There are likely some micro-architecture specific optimizations we could take advantage of, but we'd need to show real concrete numbers showing the benefit and that it would be a net negative to not just do such an optimization "everywhere". |
I tested using the replacement sequence Here are the benchmarks: BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.2.22477.23
[Host] : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
DefaultJob : .NET 7.0.0 (7.0.22.47203), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|-------------------- |----------:|----------:|----------:|
| imul | 5.291 ms | 0.0133 ms | 0.0125 ms |
| mov_shl_sub | 3.523 ms | 0.0068 ms | 0.0061 ms |
| imul_complex | 31.795 ms | 0.1466 ms | 0.1145 ms |
| mov_shl_sub_complex | 26.489 ms | 0.0437 ms | 0.0409 ms | Benchmark source: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Perf
{
public class Multiply
{
[Benchmark]
public void imul()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
value = value * 7;
}
[Benchmark]
public void mov_shl_sub()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
value = value * 8 - value;
}
[Benchmark]
public void imul_complex()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
{
value = value * 7;
value = value * 7345;
value = value * 7;
value = value * 7345;
value = value * 7;
value = value * 7345;
}
}
[Benchmark]
public void mov_shl_sub_complex()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
{
value = value * 8 - value;
value = value * 7345;
value = value * 8 - value;
value = value * 7345;
value = value * 8 - value;
value = value * 7345;
}
}
}
static class Program
{
static int Main()
{
BenchmarkRunner.Run<Multiply>();
return 0;
}
}
} Despite the replacement sequence being more instructions and a size increase, it still performs significantly better than |
Using a four instruction replacement sequence seems to be not worth it though: | Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| imul | 5.358 ms | 0.0232 ms | 0.0217 ms |
| mov_shl_sub | 5.378 ms | 0.0401 ms | 0.0375 ms | [Benchmark]
public void imul()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
value = value * 14;
}
[Benchmark]
public void mov_shl_sub()
{
ulong value = 2;
for (var i = 0; i < 10000000; i++)
value = value * 16 - value - value;
} |
I ran the identical benchmark on an older Zen 1 and got very similar results.
But then I wanted to remove all the dependency chains, because all these benchmarks stall the latency-3 multiplication. So I did some interleaving and rerun the benchmark. Notice that the number of "multiplications" triples. There are now 30 million, instead of the original 10 million. [Benchmark]
public void imul_interleave()
{
ulong valueA = 2;
ulong valueB = 2;
ulong valueC = 2;
for (var i = 10000000 ; i != 0 ; i--)
{
valueA = valueA * 7;
valueB = valueB * 7;
valueC = valueC * 7;
}
}
[Benchmark]
public void mov_shl_sub_interleave()
{
ulong valueA = 2;
ulong valueB = 2;
ulong valueC = 2;
for (var i = 10000000 ; i != 0 ; i--)
{
valueA = valueA * 8 - valueA;
valueB = valueB * 8 - valueB;
valueC = valueC * 8 - valueC;
}
} Here's the disassembly to very all is as expected: C.imul_interleave()
L0000: mov eax, 2
L0005: mov edx, 2
L000a: mov ecx, 2
L000f: mov r8d, 0x989680
L0015: nop [rax+rax]
L0020: imul rax, 7
L0024: imul rdx, 7
L0028: imul rcx, 7
L002c: dec r8d
L002f: jne short L0020
L0031: ret
C.mov_shl_sub_interleave()
L0000: mov eax, 2
L0005: mov edx, 2
L000a: mov ecx, 2
L000f: mov r8d, 0x989680
L0015: mov r9, rax
L0018: shl r9, 3
L001c: sub r9, rax
L001f: mov rax, r9
L0022: mov r9, rdx
L0025: shl r9, 3
L0029: sub r9, rdx
L002c: mov rdx, r9
L002f: mov r9, rcx
L0032: shl r9, 3
L0036: sub r9, rcx
L0039: mov rcx, r9
L003c: dec r8d
L003f: jne short L0015
L0041: ret And here are the results:
This is truly a beautiful result. I love it. While the number of multiplications tripled in imul vs imul_interleave, the execution time did not change. This beautifully shows the latency-3 of imul. The interleave variant can execute 3 multiplications in parallel as the result is not needed right away. On top, the number of shift/subtract instruction in mov_shl_sub vs. mov_shl_sub_interleave also tripled. But the multiple ALU units were easily able to fill empty slots. That's why execution time merely went up by 50%. Furthermore, the CPU clocks on avg at 4.0 GHz. 30 million imul / 4 billion ticks = 0.0075 or 7.5 msec. Now, what does this all mean? Well, artifical benchmarks are not helpful for real world scenarios. I would still go with sub/shl replacement code, as this benchmark at least shows such operations can fill in many empty slots the ALUs have available. A 3-operation replacement code should benefit as well, if there is only 1 shift operation involved (not counting lea). |
The "multiple multiplications without changing execution time" is definitely a great test. I ran your benchmark with interleaving and this was my result on the 7950X: | Method | Mean | Error | StdDev |
|----------------------- |---------:|----------:|----------:|
| imul_interleave | 5.375 ms | 0.0167 ms | 0.0140 ms |
| mov_shl_sub_interleave | 4.124 ms | 0.0165 ms | 0.0154 ms |
I think it's pretty safe to say that doing the optimization is worth it. I'm glad we all went to this depth to vet it and provided data to back it up. |
7950X runs at around 5.7 GHz, so with math this is 5.263 ms to be expected. That means multiplication is still at latency-3 in Zen 4. The fact that shl/sub is still faster tells me that Zen 4 has two shift/shuffle units. Zen 1 has only one. Actually, it should be Zen 3, that core got a major integer unit overhaul and a twice-as-fast integer division unit. Overall, I agree, it will be beneficial to replace. |
Closing as we have exhausted these optimizations now. As a FYI, we did not implement every single one of the optimizations mentioned here as some are not beneficial anymore on modern hardware. |
Description
Unlike the LLVM, RyuJIT is reluctant to optimize integer multiplication.
Example: Multiplying by 7
C#
C++ with Clang 16.0.0
Configuration
For unsigned integer constants less than 127 that is optimized by LLVM with no
imul
instruction:SharpLab
Compiler Explorer
Regression?
No
Data
Analysis
imul
instruction with 8bit/32bit immediate valueThis is what we are trying to avoid. uops.info says
imul
instruction takes at least 3 clock cycles to calculate the result, and takes around 1 clock cycle to execute another instruction that can only be run in the same execution port, on typical x86 CPUs produced by either Intel or AMD.The way of multiplying integer with constant integer without
imul
instructionAdvanced
lea
sequencelea
stands for 'Load Effective Address'. x86lea
instruction is intended to be used for calculating memory addresses, but not only that, it enables us to calculate some simple constant integer multiplications, and even integer fused multiply add in some cases.Formulas for Effective Addresses can be one of:
The "Scale" can be 1, 2, 4, or 8, and the "Displacement" can be a constant integer immediate value including 0.
The "Base" and the "Index" can be any of the following registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI.
Using a single
lea
instruction, we can multiply integers with some constant integers like 3, 5, and 9.Current RyuJIT can handle this situation correctly:
But we can do even more.
For instance, let's see how two compilers optimize multiplication by 11, 15, and 19:
We can observe LLVM is doing something new. The$Base + (Index * Scale) + Displacement$ pattern is used twice, with Displacement set to 0, and Base and Index are different sometimes. We can multiply an integer by either 1, 2, 4, or 8, then add with another integer.
shl
followed bylea
lea
instruction can also be used for multiplying integer with power of 2 added with either 1, 2, 4, or 8, though LLVM mysteriously neglects to do that with scale 8.Let's see how two compilers optimize multiplication by 66, 68, and 70:
shl
followed bysub
sub
instruction can be used for multiplying integer with power of 2 subtracted by 1.By executing
sub
twice, it can be used for multiplying integer with power of 2 subtracted by 2.Let's see how two compilers optimize multiplication by 62 and 63:
Multiplying by power of 2
When a composite number contains one or more 2s in its prime factorization, both compiler can optimize such multiplication.
For instance, let's see how two compilers optimize multiplication by 6, 20, and 72:
Interestingly, they have different preference of ordering. RyuJIT prefers multiplications to be earlier, while LLVM prefers shifts to be earlier.
Both compilers managed to optimize multiplication by 2 to
add rdi, rdi
(or whatever adds the same register together), though.Combining these techniques
By combining these techniques described above, we can further optimize integer multiplication with constant, with more constant values.
Here is a list of formulas of integer multiplication by constant values that are less than 128. Only values that are optimized by LLVM is contained in this list.
Any multiplication by power of two can be performed by an
shl
instruction.Multiplication by 3, 5, or 9 can be performed by a single
lea
instruction.The text was updated successfully, but these errors were encountered: