JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

AndyAyersMS · 2017-08-16T18:03:35Z

The jit's emitter works by combining instructions into instruction groups (hereafter IGs). The group boundaries can be formed naturally by labels or induced by the emitter when an internal buffer fills up.

The buffer capacity varies between CHK and REL builds, and the size of the buffer entries varies depending on the instructions being emitted. So the location of the induced IG breaks is not consistent between CHK and REL. This difference in behavior can lead to differences in codegen between CHK and REL.

In the one case I've seen, the differences come about as follows. During IG formation, the jit estimates instruction sizes and hence the size and offset of each IG. When actually emitting the code the jit may discover the instructions in an IG are smaller than predicted. This leads to cascading updates of the offsets of subsequent IGs. If in such an IG there is first a mis-estimated instruction M and then an instruction L that refers to an instruction in a subsequent IG, the offset computation for L will not take into account the correction from the mis-estimated M. If instead M and L are in different groups, L's offset will have the correction.

Now because CHK and REL form IGs differently, it is possible for M and L to be together in one IG in RET but not in CHK, hence the CHK offset and REL offset differ.

I don't have a repro in CoreCLR for this yet, but here's an example bit of disassembly from a desktop test that shows the difference:

;;; CHK
000001A13F620B52 48 8D 05 17 00 00 00 lea         rax,[000001A13F620B70h]  ;; offset B70
000001A13F620B59 48 89 45 40          mov         qword ptr [rbp+40h],rax
000001A13F620B5D 48 8B 45 68          mov         rax,qword ptr [rbp+68h]
000001A13F620B61 C6 40 0C 00          mov         byte ptr [rax+0Ch],0
000001A13F620B65 48 8B 85 10 01 00 00 mov         rax,qword ptr [rbp+0000000000000110h]
000001A13F620B6C FF D0                call        rax
000001A13F620B6E 48 8B 55 68          mov         rdx,qword ptr [rbp+68h]

;;; REL
000001A13F590B52 48 8D 05 16 00 00 00 lea         rax,[000001A13F590B6Fh]   ; offset B6F
000001A13F590B59 48 89 45 40          mov         qword ptr [rbp+40h],rax
000001A13F590B5D 48 8B 45 68          mov         rax,qword ptr [rbp+68h]
000001A13F590B61 C6 40 0C 00          mov         byte ptr [rax+0Ch],0
000001A13F590B65 48 8B 85 10 01 00 00 mov         rax,qword ptr [rbp+0000000000000110h]
000001A13F590B6C FF D0                call        rax
000001A13F590B6E 48 8B 55 68          mov         rdx,qword ptr [rbp+68h]

This code sequence comes from an inline pinvoke; the LEA is supposed to be storing the return address from the call. CHK and RET are inconsistent here.

It turns out both are actually wrong, and the offset should be B6E, but that is a separate issue (~~#13398~~ #8747).

Since it is unlikely we can ever avoid mis-estimating sizes, the underlying algorithm to cope with mis-estimates likely needs to be rethought.

cc @dotnet/jit-contrib

category:correctness
theme:jit-coding-style
skill-level:intermediate
cost:medium

AndyAyersMS · 2017-08-16T18:06:05Z

This is follow-on work inspired by #8735, looking for more examples of CHK/REL divergence.

sdmaclea · 2017-08-16T18:10:59Z

The buffer capacity varies between CHK and REL build

Any reason the buffer capacity must be different? Wouldn't it be better for them to be the same since we use CHK as a test proxy for REL.

AndyAyersMS · 2017-08-16T18:20:36Z

I think that was the intent. The buffer holds instrDesc structs and the size is (roughly) a multiple of that. However these structs come in different sizes, depending on what instructions are being emitted.

The instrDesc sizes also change between CHK and REL. So I don't see how one can pick the right buffer size.

It would be better to pick a buffer large enough to hold N of the largest instrDescs and then induce a group break when the instruction count reaches N.

(update to emphasize we're talking about instrDesc sizes, not the actual instruction sizes)

BruceForstall · 2017-08-16T20:34:12Z

It would be better to pick a buffer large enough to hold N of the largest instrDescs and then induce a group break when the instruction count reaches N.

That might improve CHK/REL consistency, but would be memory inefficient.

In the one case I've seen, the differences come about as follows. During IG formation, the jit estimates instruction sizes and hence the size and offset of each IG. When actually emitting the code the jit may discover the instructions in an IG are smaller than predicted. This leads to cascading updates of the offsets of subsequent IGs. If in such an IG there is first a mis-estimated instruction M and then an instruction L that refers to an instruction in a subsequent IG, the offset computation for L will not take into account the correction from the mis-estimated M. If instead M and L are in different groups, L's offset will have the correction.

Isn't the solution here to make sure all offsets take into account the currently computed adjustment (due to mis-estimation)? I assume the issue here is we don't have the correct "base" address. For forward references, we already walk a list of labels to update the offsets after emission, right? (Maybe I should go read code before commenting more...)

AndyAyersMS · 2018-01-18T01:58:32Z

The known underlying issues were all fixed but we haven't gotten around to automating testing or validation. Also I have only really looked at x64 in detail. And without some automation it very well could regress.

Still interested in doing all this but likely not happening in time for 2.1. So will push it back to Future and reconsider once we're scouting what should go into 2.2 or whatever it is that comes next.

kunalspathak · 2020-11-11T16:29:34Z

I found some more cases where this is happening. E.g. For vmovaps qword ptr [rsp+30H], xmm6 , we take into account SIBAdjustment but do not emit that byte, hence estimated size = 7 bytes, but actual size = 6 bytes. I also saw that sizes are over-estimated for intrinsic instructions like vpbroadcastw. We should really convert this into an assert.

I was planning to use the estimated size to predict if loop alignment is needed or not in #44370. If I can predict early, I can save allocating extra bytes for INS_align when we allocate the memory for the generated code of a method. However, because of over-estimation, I have to do this calculation after memory is allocated during emitting machine code. This rises the allocation cost to 350131 bytes or 0.68%. In other words, if I can get the accurate size of instruction groups, the loop alignment change on .NET libraries would have taken extra 38243 bytes or 0.07% extra memory. But since the sizes are over-estimated, I have to include INS_align instruction of 15 bytes wherever we detect that alignment might be needed, allocate memory for it only to find during emitting machine code that the INS_align instruction is not needed because it doesn't meet the criteria of loop alignment. With precise instruction size measurements, we could do that prediction early before allocating memory and get down the allocation cost from 350131 -> 38243 bytes or 0.68% -> 0.07%.

sandreenko · 2020-11-11T18:46:32Z

There is a correctness issue, that is kind of duplicate for this one: #12840
I was trying to solve both in dotnet/coreclr#25956, but did not succeed and fixed the correctness issue with a workaround.

There are some useful comments in the PR and a raw prototype, the main issue was TP regression, that was around 0.66% if I remember correctly. Also the win in terms of code size was applicable only to Crossgen, not CoreRun, is it different now?

kunalspathak · 2020-11-11T23:50:02Z

Also the win in terms of code size was applicable only to Crossgen, not CoreRun, is it different now?

It is applicable for both Crossgen and CoreRun. Crossgen in a way that it will reduce the file size and for CoreRun/Tiering JIT, it will reduce the memory allocated. From what I gathered, for .NET libraries, we allocate memory of 51824516 bytes, but generate code that is 51567864 bytes resulting in 256,652 bytes not utilized or wasted. It is not much % wise, but still a lot to reason why trimming would be necessary.

kunalspathak · 2020-11-24T19:22:20Z

I added some logging to figure out which instructions are overestimated and need fix-up. The way I did this was printing the instruction that was overestimated and running PMI on framework libraries. All the instructions are hardware intrinsics related and are over-estimated by 1 byte.

vaddsd
vcvttsd2si
vmovaps
vmovd
vmovdqa
vmovdqu
vmovss
vmovupd
vmulsd
vpackuswb
vpaddw
vpand
vpandn
vpbroadcastb
vpbroadcastd
vpbroadcastw
vpcmpeqb
vpcmpeqw
vpmovmskb
vpor
vpsubw
vpunpcklbw
vucomisd
vxorps

I also tried running crossgen and as expected I didn't see any instruction getting over-estimated because we do not generate hardware intrinsic instructions during crossgen.

cc: @AndyAyersMS , @BruceForstall

BruceForstall · 2020-11-24T20:45:55Z

There are two issues here: instruction size mis-estimates (which leads to inefficient too-large memory request from the VM), and CHK/REL codegen differences due to IG size and buffers. Maybe they should be split (if there aren't already separate issues covering one and/or the other).

cc @tannergooding @CarolEidt for the instruction size mis-estimations.

tannergooding · 2020-11-24T20:54:09Z

I wonder if this is something that was missed in VEX vs non-VEX. It would be interesting to see if there are any misreports for COMPlus_EnableAVX=0

tannergooding · 2020-11-24T20:55:24Z

If its just in VEX, I expect it has to do with the 2-byte vs 3-byte encoding. The former is only sometimes possible.

kunalspathak · 2020-11-24T21:13:43Z

There are two issues here: instruction size mis-estimates (which leads to inefficient too-large memory request from the VM), and CHK/REL codegen differences due to IG size and buffers. Maybe they should be split (if there aren't already separate issues covering one and/or the other).

Probably #12840 is the one for mis-estimates. I will add a link in that issue to the conversation happening here.

kunalspathak · 2020-11-24T21:25:30Z

I wonder if this is something that was missed in VEX vs non-VEX. It would be interesting to see if there are any misreports for COMPlus_EnableAVX=0

The only instruction that is mis-predicted by 1 byte is pminuw.

kunalspathak · 2020-11-25T01:14:53Z

@BruceForstall or @tannergooding - Does one of you know how much work it would be to fix the misprediction for these instructions?

kunalspathak · 2020-12-08T21:14:59Z

With the help of @tannergooding I was able to address misprediction for vpbroadcast* and will send out PR for that. However, as Tanner pointed out, all the other instruction misprediction happens after dotnet/coreclr#21453 . In dotnet/coreclr#21453, we started optimizing the instructions that need VEX prefix to use 2 bytes (in some cases) instead of 3 bytes. However, this optimization happens during emitting the machine instructions. I tried to see if similar optimization can be replicated around the code that estimates the instruction size, but there are many places where this logic will have to be replicated and at some places all the information needed to do this optimization (like register information, size, etc.) might not be available. The code changes will be ugly plus there are chances of introducing regression.

The way I made sure that VEX prefix is the source of mis-prediction is by printing every instruction, along with the hex code that is mispredicted and ran PMI on the libraries. If the instruction has 2 bytes VEX prefix, it starts with C5 as seen in this code, otherwise it starts with C4. All the mispredicted instructions started withC5 as seen in attached summary.txt.

summary.txt

kunalspathak · 2020-12-10T22:10:12Z

I missed reporting some more instructions that are mispredicted. They are however just around 374 of them in libraries. Here are the unique names:

crc32
imul
jae
jbe
je
jg
jge
jl
jmp
jne
jp
lea
mov
vmovnti
vprefetchnta
vprefetcht0
vprefetcht1
vprefetcht2

I guess JUMPs are expected to get mispredicted despite doing branch tensioning earlier -

runtime/src/coreclr/jit/emitxarch.cpp

Lines 12389 to 12399 in bc1ff08

    
                   if (id->idCodeSize() != JMP_SIZE_SMALL) 
        
                   { 
        
                       emitOffsAdj += id->idCodeSize() - JMP_SIZE_SMALL; 
        
           #ifdef DEBUG 
        
                       if (emitComp->verbose) 
        
                       { 
        
                           printf("; NOTE: size of jump [%08X] mis-predicted\n", emitComp->dspPtr(id)); 
        
                       } 
        
           #endif 
        
                   }

BruceForstall · 2020-12-10T22:46:52Z

Branch shortening isn't iterative so it's possible in some cases that forward branches can be shrunk afterwards if other branches that follow were shrunk during the branch shortening pass.

BruceForstall · 2021-03-22T17:57:12Z

The CHK/REL diffs problem is fixed by #49947

BruceForstall · 2021-03-22T19:04:44Z

Opened #50054 to track the mis-estimated instruction sizes. Closing this.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

kunalspathak mentioned this issue Nov 12, 2020

Align inner loops #44370

Merged

kunalspathak mentioned this issue Nov 24, 2020

Test Failed: Assert failure: 'sz == _idCodeSize' #12840

Closed

kunalspathak mentioned this issue Mar 21, 2021

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

Merged

BruceForstall self-assigned this Mar 21, 2021

BruceForstall mentioned this issue Mar 22, 2021

JIT instruction sizes are mis-estimated #50054

Open

BruceForstall closed this as completed Mar 22, 2021

ghost locked as resolved and limited conversation to collaborators Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

AndyAyersMS commented Aug 16, 2017 •

edited

Loading

AndyAyersMS commented Aug 16, 2017

sdmaclea commented Aug 16, 2017

AndyAyersMS commented Aug 16, 2017 •

edited

Loading

BruceForstall commented Aug 16, 2017

AndyAyersMS commented Jan 18, 2018

kunalspathak commented Nov 11, 2020

sandreenko commented Nov 11, 2020

kunalspathak commented Nov 11, 2020 •

edited

Loading

kunalspathak commented Nov 24, 2020 •

edited

Loading

BruceForstall commented Nov 24, 2020

tannergooding commented Nov 24, 2020

tannergooding commented Nov 24, 2020

kunalspathak commented Nov 24, 2020

kunalspathak commented Nov 24, 2020

kunalspathak commented Nov 25, 2020

kunalspathak commented Dec 8, 2020

kunalspathak commented Dec 10, 2020

BruceForstall commented Dec 10, 2020

BruceForstall commented Mar 22, 2021

BruceForstall commented Mar 22, 2021

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

Comments

AndyAyersMS commented Aug 16, 2017 • edited Loading

AndyAyersMS commented Aug 16, 2017

sdmaclea commented Aug 16, 2017

AndyAyersMS commented Aug 16, 2017 • edited Loading

BruceForstall commented Aug 16, 2017

AndyAyersMS commented Jan 18, 2018

kunalspathak commented Nov 11, 2020

sandreenko commented Nov 11, 2020

kunalspathak commented Nov 11, 2020 • edited Loading

kunalspathak commented Nov 24, 2020 • edited Loading

BruceForstall commented Nov 24, 2020

tannergooding commented Nov 24, 2020

tannergooding commented Nov 24, 2020

kunalspathak commented Nov 24, 2020

kunalspathak commented Nov 24, 2020

kunalspathak commented Nov 25, 2020

kunalspathak commented Dec 8, 2020

kunalspathak commented Dec 10, 2020

BruceForstall commented Dec 10, 2020

BruceForstall commented Mar 22, 2021

BruceForstall commented Mar 22, 2021

AndyAyersMS commented Aug 16, 2017 •

edited

Loading

AndyAyersMS commented Aug 16, 2017 •

edited

Loading

kunalspathak commented Nov 11, 2020 •

edited

Loading

kunalspathak commented Nov 24, 2020 •

edited

Loading