-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213
Comments
Related to #4155 |
Div and mod have nothing to do with addressing, that's always done with add and mul. You're probably talking about something else. |
Div and mod are pretty useful to recover row/column information from indices:
Mostly when you are treating a singular array as a 2D array (eg. through usage of |
@mikedn Actually this is a duplicate of #4155. I would have never though about calling it @CarolEidt Is the limitation of using multiple registers per single instruction still relevant? |
Yes, the multiple register issue is still relevant, but likely something we will soon be addressing. |
What @JanielS shows is not so common and certainly not related to the notion of "addressing modes" as used in compilers. More generally, accessing matrix elements has nothing to do with division and modulo. The offset of a matrix element is simply |
@mikedn They are just two sides of the same coin.
and
Putting all in the context of the assembly/compiler use is why you are seeing it as special/unusual, where it is not. This pattern is usually hidden in plain sight, it is so common and used, that is very easy to overlook. Only in certain places you will see both together because that means you are trying to retrieve both values at the same time instead of accessing it using a for loop. |
Any update on if we are going to see this anytime soon? And the code that generates it. This is actual code I cannot get away with, I will probably try to tweak it to reduce dependencies, but that's it. This is killing me, if I can get even 5% would be a huge win. :) EDIT: All the program in context (marked in red address translations - div/mod heavy cc @AndyAyersMS |
Does |
@mikedn That's a great idea, I can probably pack 8 to 10 multiplications before paying for an |
Before eliminating division you should eliminate the reminder. Now, getting rid of the division completely may or may not be feasible. For a given denominator you have to store 2 numbers somewhere - a "magic" number and a shift count. Computing these 2 numbers is expensive so if you can't store them somewhere this won't work. The exacts details can be found here, in the "Exact division by constants" section. To make your life easier you could simply pick up existing code from a compiler. RyuJIT has an implementation and I extracted it here. It's not as fast as I hoped but it's still ~1.5x faster than division. The code generated by the compiler when the denominator is actually a constant is faster (3x) but that's expected, for example that code doesn't have to deal with positive/negative magic numbers at runtime. Note that this implementation only works for positive numbers, supporting negative numbers makes |
@mikedn I can do that on tree creation and just store it and be done with it :). |
Until there's a solution for this in the JIT, should: public static long DivRem(long a, long b, out long result)
{
result = a % b;
return a / b;
} to: public static long DivRem(long a, long b, out long result)
{
long div = a / b;
result = a - (div * b);
return div;
} ? (And the same for the Int32 overload.) |
@stephentoub Yes, it should. I can confirm that the performance of using the multiplication in the second case (in my case) has been huge. |
Changing |
Ok, I'll submit a PR for that, including a TODO linking to this issue. |
Bump. Another case of a missed opportunity (see comments): https://ayende.com/blog/176993/when-the-code-says-you-are-stupid-but-you-are-too-stupid-to-know-that?key=e132f8d5476047c69748ef298fd8897a |
I would guess that |
Not that one. This one: const int ConstDivisor = 4 * 1024;
const int DivideAdjustment = ConstDivisor - 1;
static int DivTest1(int num)
{
return num / ConstDivisor + (num % ConstDivisor == 0 ? 0 : 1);
} |
How's that different? It still divides by 4096. |
Read it wrong, It is using ConstDivisor and not DivideAdjustment. Anyway, just change the ConstDivisor to non const division and you get the same issue. For power of 2 and also const is indeed another issue probably worth to report by itself. EDIT: Lately we decided to eliminate the flexibility and now all of our page sizes are constants just to avoid this issue altogether with a huge improvement as a result. |
Another case for power of 2 constants. [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int GetNumberOfOverflowPages(long overflowSize)
{
overflowSize += Constants.Tree.PageHeaderSize;
return (int)(overflowSize / Constants.Storage.PageSize) + (overflowSize % Constants.Storage.PageSize == 0 ? 0 : 1);
} This code generates: 00007FFA13645D12 48 63 CB movsxd rcx,ebx
00007FFA13645D15 48 83 C1 40 add rcx,40h
00007FFA13645D19 48 8B D1 mov rdx,rcx
00007FFA13645D1C 4C 8B C2 mov r8,rdx
00007FFA13645D1F 49 C1 F8 3F sar r8,3Fh
00007FFA13645D23 49 81 E0 FF 1F 00 00 and r8,1FFFh
00007FFA13645D2A 4C 03 C2 add r8,rdx
00007FFA13645D2D 49 C1 F8 0D sar r8,0Dh
00007FFA13645D31 41 8B D0 mov edx,r8d
00007FFA13645D34 4C 8B C1 mov r8,rcx
00007FFA13645D37 49 C1 F8 3F sar r8,3Fh
00007FFA13645D3B 49 81 E0 FF 1F 00 00 and r8,1FFFh
00007FFA13645D42 4C 03 C1 add r8,rcx
00007FFA13645D45 49 81 E0 00 E0 FF FF and r8,0FFFFFFFFFFFFE000h
00007FFA13645D4C 49 2B C8 sub rcx,r8 However you can actually rewrite that (knowing that the constant is power of 2) into: [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int GetNumberOfOverflowPages(long overflowSize)
{
overflowSize += Constants.Tree.PageHeaderSize;
return (int)(overflowSize >> Constants.Storage.PageSizeShift) + ((overflowSize & Constants.Storage.PageSizeMask) == 0 ? 0 : 1);
} Which will output: 00007FFA13635FF2 48 63 CB movsxd rcx,ebx
00007FFA13635FF5 48 83 C1 40 add rcx,40h
00007FFA13635FF9 48 8B D1 mov rdx,rcx
00007FFA13635FFC 48 C1 FA 0D sar rdx,0Dh
00007FFA13636000 F7 C1 FF 1F 00 00 test ecx,1FFFh
00007FFA13636006 74 07 je 00007FFA1363600F
00007FFA13636008 B9 01 00 00 00 mov ecx,1
00007FFA1363600D EB 02 jmp 00007FFA13636011
00007FFA1363600F 33 C9 xor ecx,ecx
00007FFA13636011 44 8D 34 0A lea r14d,[rdx+rcx] That is half the bytes. |
That's unrelated to the original issue, there's no |
Yes I should have created a new one instead. While it is the same combination (using division and modulus over the same parameters), this one is using constants the other is not. Created https://github.com/dotnet/coreclr/issues/13380 |
Any update? |
This should be solved by #66551. |
Just calling |
@pentp Would |
Probably yes, but for long/ulong needs #67285 (or part of it). |
Related: #4155 |
Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics. See: dotnet/runtime#5213 (comment)
Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics. See: dotnet/runtime#5213 (comment) Co-authored-by: lechu445 <xxx@example.com>
Found this trying to optimize a very low level memory addressing algorithm. Suddenly I noticed that we were doing the idiv twice.
Repro on rc2-16551:
will output (in x64 but will also happen in x86):
this could be written as:
This is pretty common in addressing modes or accessing matrix locations, etc. Everywhere you have a row and a column, you will have such an structure.
category:cq
theme:div-mod-rem
skill-level:expert
cost:large
impact:medium
The text was updated successfully, but these errors were encountered: