[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

redknightlois · 2016-03-01T03:43:03Z

Found this trying to optimize a very low level memory addressing algorithm. Suddenly I noticed that we were doing the idiv twice.

Repro on rc2-16551:

  public class Program
    {
        public long Dividend;

        [MethodImpl(MethodImplOptions.NoInlining)]
        public long Run(long parameter)
        {
            long div = parameter / Dividend;
            long mod = parameter % Dividend;

            return div + mod;
        }

        public static void Main(string[] args)
        {
            Random r = new Random();

            var p = new Program();
            p.Dividend = 10;

            long value = p.Run(203);
            Console.WriteLine($"{value}");
            Console.ReadLine();
        }
    }

will output (in x64 but will also happen in x86):

00007FF8E8EC3C83  mov         rcx,qword ptr [rcx+8]  
00007FF8E8EC3C87  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C8A  cqo                    
00007FF8E8EC3C8C  idiv        rax,rcx         // parameter / Dividend
00007FF8E8EC3C8F  mov         r9,rax          // r9:= divResult 
00007FF8E8EC3C92  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C95  cqo  
00007FF8E8EC3C97  idiv        rax,rcx         //  parameter / Dividend
00007FF8E8EC3C9A  lea         rax,[r9+rdx]    // rax := divResult + modulus

this could be written as:

00007FF8E8EC3C83  mov         rcx,qword ptr [rcx+8]  
00007FF8E8EC3C87  mov         rax,r8          // 2 instructions to prepare for idiv. 
00007FF8E8EC3C8A  cqo                    
00007FF8E8EC3C8C  idiv        rax,rcx         //  parameter / Dividend
00007FF8E8EC3C92  add         rax,rdx         //  rax := divResult + modulus

This is pretty common in addressing modes or accessing matrix locations, etc. Everywhere you have a row and a column, you will have such an structure.

category:cq
theme:div-mod-rem
skill-level:expert
cost:large
impact:medium

mikedn · 2016-03-01T06:29:45Z

Related to #4155

mikedn · 2016-03-01T06:32:28Z

This is pretty common in addressing modes or accessing matrix locations, etc. Everywhere you have a row and a column, you will have such an structure.

Div and mod have nothing to do with addressing, that's always done with add and mul. You're probably talking about something else.

jakobbotsch · 2016-03-01T07:32:26Z

Div and mod have nothing to do with addressing, that's always done with add and mul. You're probably talking about something else.

Div and mod are pretty useful to recover row/column information from indices:

int index = row * width + col;
...
int row = index / width;
int col = index % width;

Mostly when you are treating a singular array as a 2D array (eg. through usage of Bitmap.LockBits).
That said, it's not very common, but not extremely rare either.

redknightlois · 2016-03-01T13:51:23Z

@mikedn It is if you are working with stride based binary structures as @JanielS mentioned. I would mention too, that when you are dealing with building data structures using memory maps (therefore unsafe code), it is a pretty common thing to happen.

redknightlois · 2016-03-01T14:00:13Z

@mikedn Actually this is a duplicate of #4155. I would have never though about calling it DivRem that's why I havent found it.

@CarolEidt Is the limitation of using multiple registers per single instruction still relevant?

CarolEidt · 2016-03-01T14:07:46Z

Yes, the multiple register issue is still relevant, but likely something we will soon be addressing.

mikedn · 2016-03-01T14:22:52Z

It is if you are working with stride based binary structures as @JanielS mentioned.

What @JanielS shows is not so common and certainly not related to the notion of "addressing modes" as used in compilers. More generally, accessing matrix elements has nothing to do with division and modulo. The offset of a matrix element is simply row * stride + column. Sure, there may be special/unusual/whatever cases that requires division and modulo but it's a long way from such cases to "Everywhere you have a row and a column, you will have such an structure.".

redknightlois · 2016-03-01T16:49:28Z

@mikedn They are just two sides of the same coin.

AbsolutePosition := row * stride + column

and

row := AbsolutePosition / TotalColumns
column = AbsolutePosition % TotalColumns

Putting all in the context of the assembly/compiler use is why you are seeing it as special/unusual, where it is not. This pattern is usually hidden in plain sight, it is so common and used, that is very easy to overlook. Only in certain places you will see both together because that means you are trying to retrieve both values at the same time instead of accessing it using a for loop.

redknightlois · 2016-11-11T21:15:58Z

Any update on if we are going to see this anytime soon?

And the code that generates it.

This is actual code I cannot get away with, I will probably try to tweak it to reduce dependencies, but that's it. This is killing me, if I can get even 5% would be a huge win. :)

EDIT: All the program in context (marked in red address translations - div/mod heavy

)

cc @AndyAyersMS

mikedn · 2016-11-11T22:08:28Z

Does BlocksPerPage change often? I would guess that it doesn't, it's fixed for a given BTree. If that's the case then you could use reciprocal division, it would likely be better than the JIT removing a redundant idiv.

redknightlois · 2016-11-11T22:43:18Z

@mikedn That's a great idea, I can probably pack 8 to 10 multiplications before paying for an idiv even more with 2 of them, I can do almost whatever I want and still probably win. Do you have any good resource on how to efficiently deal with a "sorta" fixed (but not constant) divisor?

mikedn · 2016-11-12T22:28:31Z

Before eliminating division you should eliminate the reminder. n % d is simply n - (n / d) * d. You already have n / d so to compute % you only need a - and a * and these are very cheap compared to %. Come to think of it, the JIT could probably do this automatically if it turns out that removing the redundant idiv is problematic due to the multiple register issue.

Now, getting rid of the division completely may or may not be feasible. For a given denominator you have to store 2 numbers somewhere - a "magic" number and a shift count. Computing these 2 numbers is expensive so if you can't store them somewhere this won't work.

The exacts details can be found here, in the "Exact division by constants" section. To make your life easier you could simply pick up existing code from a compiler. RyuJIT has an implementation and I extracted it here.

It's not as fast as I hoped but it's still ~1.5x faster than division. The code generated by the compiler when the denominator is actually a constant is faster (3x) but that's expected, for example that code doesn't have to deal with positive/negative magic numbers at runtime.

Note that this implementation only works for positive numbers, supporting negative numbers makes MagicDiv more complicated and probably slower.

redknightlois · 2016-11-12T23:57:49Z

@mikedn I can do that on tree creation and just store it and be done with it :).

stephentoub · 2016-11-14T20:11:13Z

Before eliminating division you should eliminate the reminder. n % d is simply n - (n / d) * d. You already have n / d so to compute % you only need a - and a * and these are very cheap compared to %. Come to think of it, the JIT could probably do this automatically if it turns out that removing the redundant idiv is problematic due to the multiple register issue.

Until there's a solution for this in the JIT, should:
https://github.com/dotnet/coreclr/blob/f8b8b6ab80f1c5b30cc04676ca2e084ce200161e/src/mscorlib/src/System/Math.cs#L733
be changed from:

public static long DivRem(long a, long b, out long result)
{
    result = a % b;
    return a / b;
}

to:

public static long DivRem(long a, long b, out long result)
{
    long div = a / b;
    result = a - (div * b);
    return div;
}

? (And the same for the Int32 overload.)

redknightlois · 2016-11-14T20:19:13Z

@stephentoub Yes, it should. I can confirm that the performance of using the multiplication in the second case (in my case) has been huge.

mikedn · 2016-11-14T20:21:16Z

Changing DivRem like that sounds like a good plan, but we need to remember to revert the change once (if) the JIT starts optimizing this. Unless we also make JIT recognize a - (div * b), I'm toying with the idea.

stephentoub · 2016-11-15T12:44:29Z

Changing DivRem like that sounds like a good pla

Ok, I'll submit a PR for that, including a TODO linking to this issue.

redknightlois · 2017-02-17T14:32:14Z

Bump. Another case of a missed opportunity (see comments): https://ayende.com/blog/176993/when-the-code-says-you-are-stupid-but-you-are-too-stupid-to-know-that?key=e132f8d5476047c69748ef298fd8897a

mikedn · 2017-02-17T15:02:25Z

I would guess that Constants.Size.Kilobyte is 1024 so you're dividing by 4096. So your example has nothing to do with this issue...

redknightlois · 2017-02-17T15:06:56Z

Not that one.

This one:

const int ConstDivisor = 4 * 1024;
const int DivideAdjustment = ConstDivisor - 1;
static int DivTest1(int num)
{
    return num / ConstDivisor + (num % ConstDivisor == 0 ? 0 : 1);
}

mikedn · 2017-02-17T15:17:10Z

How's that different? It still divides by 4096.

redknightlois · 2017-02-17T15:22:43Z

Read it wrong, It is using ConstDivisor and not DivideAdjustment. Anyway, just change the ConstDivisor to non const division and you get the same issue. For power of 2 and also const is indeed another issue probably worth to report by itself.

EDIT: Lately we decided to eliminate the flexibility and now all of our page sizes are constants just to avoid this issue altogether with a huge improvement as a result.

redknightlois · 2017-08-15T15:28:39Z

Another case for power of 2 constants.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int GetNumberOfOverflowPages(long overflowSize)
        {
            overflowSize += Constants.Tree.PageHeaderSize;
            return (int)(overflowSize / Constants.Storage.PageSize) + (overflowSize % Constants.Storage.PageSize == 0 ? 0 : 1);
        }

This code generates:

00007FFA13645D12 48 63 CB             movsxd      rcx,ebx  
00007FFA13645D15 48 83 C1 40          add         rcx,40h  
00007FFA13645D19 48 8B D1             mov         rdx,rcx  
00007FFA13645D1C 4C 8B C2             mov         r8,rdx  
00007FFA13645D1F 49 C1 F8 3F          sar         r8,3Fh  
00007FFA13645D23 49 81 E0 FF 1F 00 00 and         r8,1FFFh  
00007FFA13645D2A 4C 03 C2             add         r8,rdx  
00007FFA13645D2D 49 C1 F8 0D          sar         r8,0Dh  
00007FFA13645D31 41 8B D0             mov         edx,r8d  
00007FFA13645D34 4C 8B C1             mov         r8,rcx  
00007FFA13645D37 49 C1 F8 3F          sar         r8,3Fh  
00007FFA13645D3B 49 81 E0 FF 1F 00 00 and         r8,1FFFh  
00007FFA13645D42 4C 03 C1             add         r8,rcx  
00007FFA13645D45 49 81 E0 00 E0 FF FF and         r8,0FFFFFFFFFFFFE000h  
00007FFA13645D4C 49 2B C8             sub         rcx,r8

However you can actually rewrite that (knowing that the constant is power of 2) into:

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int GetNumberOfOverflowPages(long overflowSize)
        {            
            overflowSize += Constants.Tree.PageHeaderSize;
            return (int)(overflowSize >> Constants.Storage.PageSizeShift) + ((overflowSize & Constants.Storage.PageSizeMask) == 0 ? 0 : 1);
        }

Which will output:

00007FFA13635FF2 48 63 CB             movsxd      rcx,ebx  
00007FFA13635FF5 48 83 C1 40          add         rcx,40h  
00007FFA13635FF9 48 8B D1             mov         rdx,rcx  
00007FFA13635FFC 48 C1 FA 0D          sar         rdx,0Dh  
00007FFA13636000 F7 C1 FF 1F 00 00    test        ecx,1FFFh  
00007FFA13636006 74 07                je          00007FFA1363600F  
00007FFA13636008 B9 01 00 00 00       mov         ecx,1  
00007FFA1363600D EB 02                jmp         00007FFA13636011  
00007FFA1363600F 33 C9                xor         ecx,ecx  
00007FFA13636011 44 8D 34 0A          lea         r14d,[rdx+rcx]

That is half the bytes.

mikedn · 2017-08-15T16:00:16Z

Another case for power of 2 constants.

That's unrelated to the original issue, there's no idiv in this case.

redknightlois · 2017-08-15T18:45:03Z

Yes I should have created a new one instead. While it is the same combination (using division and modulus over the same parameters), this one is using constants the other is not. Created https://github.com/dotnet/coreclr/issues/13380

caihongxu · 2020-09-21T22:39:37Z

Any update?

huoyaoyuan · 2022-09-24T08:04:04Z

This should be solved by #66551. Math.DivRem and X86Base.DivRem would be able to utilize the two registers then.
Though DivRem is effectively a workaround, it should cover the desired use cases.

pentp · 2022-09-26T12:04:13Z

Just calling X86Base.DivRem from Math.DivRem isn't the entire solution - it would be bad for constant divisors because it would bypass JIT optimizations and always execute the div/idiv instruction. So even if there isn't going to be a general JIT solution for this, then Math.DivRem would at least have to be made a JIT intrinsic.

svick · 2022-09-26T14:45:32Z

@pentp Would RuntimeHelpers.IsKnownConstant help with that?

pentp · 2022-09-26T14:50:23Z

Probably yes, but for long/ulong needs #67285 (or part of it).

TIHan · 2023-01-28T00:37:08Z

Related: #4155

Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics. See: dotnet/runtime#5213 (comment)

Instead of division then remainder operation, used Math.DivRem which handles it in a more optimized way. It uses only a single division, multiplication, and subtraction instead of two divisions. Moreover, in the future, it can be even more optimized using intrinsics. See: dotnet/runtime#5213 (comment) Co-authored-by: lechu445 <xxx@example.com>

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

EgorBo mentioned this issue Feb 20, 2020

RyuJIT: should always decompose GT_MOD into a - (a / b) * b for CSE #32615

Open

Maykeye mentioned this issue Sep 12, 2020

Add System.Math.DivRem for ulong and other integers #42156

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

tannergooding mentioned this issue Sep 14, 2022

Suboptimal x64 codegen for signed Math.BigMul #75594

Open

TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022

TIHan self-assigned this Oct 31, 2022

TIHan modified the milestones: Future, 8.0.0 Jan 17, 2023

TIHan removed their assignment Jan 28, 2023

lechu445 mentioned this issue Feb 12, 2023

More optimized remainder calculation microsoft/Microsoft.IO.RecyclableMemoryStream#273

Merged

JulieLeeMSFT assigned TIHan Jun 1, 2023

TIHan modified the milestones: 8.0.0, Future Jun 9, 2023

lilinus added a commit to lilinus/runtime that referenced this issue Apr 12, 2024

Add some TODO dotnet#5213 comments for

211230d

lilinus mentioned this issue Apr 12, 2024

Add some TODO #5213 comments #100954

Closed

BruceForstall unassigned TIHan Oct 15, 2024

huoyaoyuan mentioned this issue Feb 21, 2025

JIT: Missed optimization for integer x / y; x % y #112774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

redknightlois commented Mar 1, 2016 •

edited by BruceForstall

Loading

mikedn commented Mar 1, 2016

mikedn commented Mar 1, 2016

jakobbotsch commented Mar 1, 2016

redknightlois commented Mar 1, 2016

redknightlois commented Mar 1, 2016

CarolEidt commented Mar 1, 2016

mikedn commented Mar 1, 2016

redknightlois commented Mar 1, 2016

redknightlois commented Nov 11, 2016 •

edited

Loading

mikedn commented Nov 11, 2016

redknightlois commented Nov 11, 2016 •

edited

Loading

mikedn commented Nov 12, 2016

redknightlois commented Nov 12, 2016

stephentoub commented Nov 14, 2016 •

edited

Loading

redknightlois commented Nov 14, 2016

mikedn commented Nov 14, 2016

stephentoub commented Nov 15, 2016

redknightlois commented Feb 17, 2017 •

edited

Loading

mikedn commented Feb 17, 2017

redknightlois commented Feb 17, 2017

mikedn commented Feb 17, 2017

redknightlois commented Feb 17, 2017 •

edited

Loading

redknightlois commented Aug 15, 2017 •

edited

Loading

mikedn commented Aug 15, 2017

redknightlois commented Aug 15, 2017

caihongxu commented Sep 21, 2020

huoyaoyuan commented Sep 24, 2022 •

edited

Loading

pentp commented Sep 26, 2022

svick commented Sep 26, 2022

pentp commented Sep 26, 2022

TIHan commented Jan 28, 2023

[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

[RyuJIT] Div then Mod (or viceversa) executes idiv twice instead of using the value in the register #5213

Comments

redknightlois commented Mar 1, 2016 • edited by BruceForstall Loading

mikedn commented Mar 1, 2016

mikedn commented Mar 1, 2016

jakobbotsch commented Mar 1, 2016

redknightlois commented Mar 1, 2016

redknightlois commented Mar 1, 2016

CarolEidt commented Mar 1, 2016

mikedn commented Mar 1, 2016

redknightlois commented Mar 1, 2016

redknightlois commented Nov 11, 2016 • edited Loading

mikedn commented Nov 11, 2016

redknightlois commented Nov 11, 2016 • edited Loading

mikedn commented Nov 12, 2016

redknightlois commented Nov 12, 2016

stephentoub commented Nov 14, 2016 • edited Loading

redknightlois commented Nov 14, 2016

mikedn commented Nov 14, 2016

stephentoub commented Nov 15, 2016

redknightlois commented Feb 17, 2017 • edited Loading

mikedn commented Feb 17, 2017

redknightlois commented Feb 17, 2017

mikedn commented Feb 17, 2017

redknightlois commented Feb 17, 2017 • edited Loading

redknightlois commented Aug 15, 2017 • edited Loading

mikedn commented Aug 15, 2017

redknightlois commented Aug 15, 2017

caihongxu commented Sep 21, 2020

huoyaoyuan commented Sep 24, 2022 • edited Loading

pentp commented Sep 26, 2022

svick commented Sep 26, 2022

pentp commented Sep 26, 2022

TIHan commented Jan 28, 2023

redknightlois commented Mar 1, 2016 •

edited by BruceForstall

Loading

redknightlois commented Nov 11, 2016 •

edited

Loading

redknightlois commented Nov 11, 2016 •

edited

Loading

stephentoub commented Nov 14, 2016 •

edited

Loading

redknightlois commented Feb 17, 2017 •

edited

Loading

redknightlois commented Feb 17, 2017 •

edited

Loading

redknightlois commented Aug 15, 2017 •

edited

Loading

huoyaoyuan commented Sep 24, 2022 •

edited

Loading