Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

Closed
AndyAyersMS opened this issue Aug 16, 2017 · 20 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug JitUntriaged CLR JIT issues needing additional triage
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Aug 16, 2017

The jit's emitter works by combining instructions into instruction groups (hereafter IGs). The group boundaries can be formed naturally by labels or induced by the emitter when an internal buffer fills up.

The buffer capacity varies between CHK and REL builds, and the size of the buffer entries varies depending on the instructions being emitted. So the location of the induced IG breaks is not consistent between CHK and REL. This difference in behavior can lead to differences in codegen between CHK and REL.

In the one case I've seen, the differences come about as follows. During IG formation, the jit estimates instruction sizes and hence the size and offset of each IG. When actually emitting the code the jit may discover the instructions in an IG are smaller than predicted. This leads to cascading updates of the offsets of subsequent IGs. If in such an IG there is first a mis-estimated instruction M and then an instruction L that refers to an instruction in a subsequent IG, the offset computation for L will not take into account the correction from the mis-estimated M. If instead M and L are in different groups, L's offset will have the correction.

Now because CHK and REL form IGs differently, it is possible for M and L to be together in one IG in RET but not in CHK, hence the CHK offset and REL offset differ.

I don't have a repro in CoreCLR for this yet, but here's an example bit of disassembly from a desktop test that shows the difference:

;;; CHK
000001A13F620B52 48 8D 05 17 00 00 00 lea         rax,[000001A13F620B70h]  ;; offset B70
000001A13F620B59 48 89 45 40          mov         qword ptr [rbp+40h],rax
000001A13F620B5D 48 8B 45 68          mov         rax,qword ptr [rbp+68h]
000001A13F620B61 C6 40 0C 00          mov         byte ptr [rax+0Ch],0
000001A13F620B65 48 8B 85 10 01 00 00 mov         rax,qword ptr [rbp+0000000000000110h]
000001A13F620B6C FF D0                call        rax
000001A13F620B6E 48 8B 55 68          mov         rdx,qword ptr [rbp+68h]

;;; REL
000001A13F590B52 48 8D 05 16 00 00 00 lea         rax,[000001A13F590B6Fh]   ; offset B6F
000001A13F590B59 48 89 45 40          mov         qword ptr [rbp+40h],rax
000001A13F590B5D 48 8B 45 68          mov         rax,qword ptr [rbp+68h]
000001A13F590B61 C6 40 0C 00          mov         byte ptr [rax+0Ch],0
000001A13F590B65 48 8B 85 10 01 00 00 mov         rax,qword ptr [rbp+0000000000000110h]
000001A13F590B6C FF D0                call        rax
000001A13F590B6E 48 8B 55 68          mov         rdx,qword ptr [rbp+68h]

This code sequence comes from an inline pinvoke; the LEA is supposed to be storing the return address from the call. CHK and RET are inconsistent here.

It turns out both are actually wrong, and the offset should be B6E, but that is a separate issue (#13398 #8747).

Since it is unlikely we can ever avoid mis-estimating sizes, the underlying algorithm to cope with mis-estimates likely needs to be rethought.

cc @dotnet/jit-contrib

category:correctness
theme:jit-coding-style
skill-level:intermediate
cost:medium

@AndyAyersMS
Copy link
Member Author

This is follow-on work inspired by #8735, looking for more examples of CHK/REL divergence.

@sdmaclea
Copy link
Contributor

The buffer capacity varies between CHK and REL build

Any reason the buffer capacity must be different? Wouldn't it be better for them to be the same since we use CHK as a test proxy for REL.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Aug 16, 2017

I think that was the intent. The buffer holds instrDesc structs and the size is (roughly) a multiple of that. However these structs come in different sizes, depending on what instructions are being emitted.

The instrDesc sizes also change between CHK and REL. So I don't see how one can pick the right buffer size.

It would be better to pick a buffer large enough to hold N of the largest instrDescs and then induce a group break when the instruction count reaches N.

(update to emphasize we're talking about instrDesc sizes, not the actual instruction sizes)

@BruceForstall
Copy link
Member

It would be better to pick a buffer large enough to hold N of the largest instrDescs and then induce a group break when the instruction count reaches N.

That might improve CHK/REL consistency, but would be memory inefficient.

In the one case I've seen, the differences come about as follows. During IG formation, the jit estimates instruction sizes and hence the size and offset of each IG. When actually emitting the code the jit may discover the instructions in an IG are smaller than predicted. This leads to cascading updates of the offsets of subsequent IGs. If in such an IG there is first a mis-estimated instruction M and then an instruction L that refers to an instruction in a subsequent IG, the offset computation for L will not take into account the correction from the mis-estimated M. If instead M and L are in different groups, L's offset will have the correction.

Isn't the solution here to make sure all offsets take into account the currently computed adjustment (due to mis-estimation)? I assume the issue here is we don't have the correct "base" address. For forward references, we already walk a list of labels to update the offsets after emission, right? (Maybe I should go read code before commenting more...)

@AndyAyersMS
Copy link
Member Author

The known underlying issues were all fixed but we haven't gotten around to automating testing or validation. Also I have only really looked at x64 in detail. And without some automation it very well could regress.

Still interested in doing all this but likely not happening in time for 2.1. So will push it back to Future and reconsider once we're scouting what should go into 2.2 or whatever it is that comes next.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@kunalspathak
Copy link
Member

I found some more cases where this is happening. E.g. For vmovaps qword ptr [rsp+30H], xmm6 , we take into account SIBAdjustment but do not emit that byte, hence estimated size = 7 bytes, but actual size = 6 bytes. I also saw that sizes are over-estimated for intrinsic instructions like vpbroadcastw. We should really convert this into an assert.

I was planning to use the estimated size to predict if loop alignment is needed or not in #44370. If I can predict early, I can save allocating extra bytes for INS_align when we allocate the memory for the generated code of a method. However, because of over-estimation, I have to do this calculation after memory is allocated during emitting machine code. This rises the allocation cost to 350131 bytes or 0.68%. In other words, if I can get the accurate size of instruction groups, the loop alignment change on .NET libraries would have taken extra 38243 bytes or 0.07% extra memory. But since the sizes are over-estimated, I have to include INS_align instruction of 15 bytes wherever we detect that alignment might be needed, allocate memory for it only to find during emitting machine code that the INS_align instruction is not needed because it doesn't meet the criteria of loop alignment. With precise instruction size measurements, we could do that prediction early before allocating memory and get down the allocation cost from 350131 -> 38243 bytes or 0.68% -> 0.07%.

@sandreenko
Copy link
Contributor

There is a correctness issue, that is kind of duplicate for this one: #12840
I was trying to solve both in dotnet/coreclr#25956, but did not succeed and fixed the correctness issue with a workaround.

There are some useful comments in the PR and a raw prototype, the main issue was TP regression, that was around 0.66% if I remember correctly. Also the win in terms of code size was applicable only to Crossgen, not CoreRun, is it different now?

@kunalspathak
Copy link
Member

kunalspathak commented Nov 11, 2020

Also the win in terms of code size was applicable only to Crossgen, not CoreRun, is it different now?

It is applicable for both Crossgen and CoreRun. Crossgen in a way that it will reduce the file size and for CoreRun/Tiering JIT, it will reduce the memory allocated. From what I gathered, for .NET libraries, we allocate memory of 51824516 bytes, but generate code that is 51567864 bytes resulting in 256,652 bytes not utilized or wasted. It is not much % wise, but still a lot to reason why trimming would be necessary.

@kunalspathak
Copy link
Member

kunalspathak commented Nov 24, 2020

I added some logging to figure out which instructions are overestimated and need fix-up. The way I did this was printing the instruction that was overestimated and running PMI on framework libraries. All the instructions are hardware intrinsics related and are over-estimated by 1 byte.

vaddsd
vcvttsd2si
vmovaps
vmovd
vmovdqa
vmovdqu
vmovss
vmovupd
vmulsd
vpackuswb
vpaddw
vpand
vpandn
vpbroadcastb
vpbroadcastd
vpbroadcastw
vpcmpeqb
vpcmpeqw
vpmovmskb
vpor
vpsubw
vpunpcklbw
vucomisd
vxorps

I also tried running crossgen and as expected I didn't see any instruction getting over-estimated because we do not generate hardware intrinsic instructions during crossgen.

cc: @AndyAyersMS , @BruceForstall

@BruceForstall
Copy link
Member

There are two issues here: instruction size mis-estimates (which leads to inefficient too-large memory request from the VM), and CHK/REL codegen differences due to IG size and buffers. Maybe they should be split (if there aren't already separate issues covering one and/or the other).

cc @tannergooding @CarolEidt for the instruction size mis-estimations.

@tannergooding
Copy link
Member

I wonder if this is something that was missed in VEX vs non-VEX. It would be interesting to see if there are any misreports for COMPlus_EnableAVX=0

@tannergooding
Copy link
Member

If its just in VEX, I expect it has to do with the 2-byte vs 3-byte encoding. The former is only sometimes possible.

@kunalspathak
Copy link
Member

There are two issues here: instruction size mis-estimates (which leads to inefficient too-large memory request from the VM), and CHK/REL codegen differences due to IG size and buffers. Maybe they should be split (if there aren't already separate issues covering one and/or the other).

Probably #12840 is the one for mis-estimates. I will add a link in that issue to the conversation happening here.

@kunalspathak
Copy link
Member

I wonder if this is something that was missed in VEX vs non-VEX. It would be interesting to see if there are any misreports for COMPlus_EnableAVX=0

The only instruction that is mis-predicted by 1 byte is pminuw.

@kunalspathak
Copy link
Member

@BruceForstall or @tannergooding - Does one of you know how much work it would be to fix the misprediction for these instructions?

@kunalspathak
Copy link
Member

With the help of @tannergooding I was able to address misprediction for vpbroadcast* and will send out PR for that. However, as Tanner pointed out, all the other instruction misprediction happens after dotnet/coreclr#21453 . In dotnet/coreclr#21453, we started optimizing the instructions that need VEX prefix to use 2 bytes (in some cases) instead of 3 bytes. However, this optimization happens during emitting the machine instructions. I tried to see if similar optimization can be replicated around the code that estimates the instruction size, but there are many places where this logic will have to be replicated and at some places all the information needed to do this optimization (like register information, size, etc.) might not be available. The code changes will be ugly plus there are chances of introducing regression.

The way I made sure that VEX prefix is the source of mis-prediction is by printing every instruction, along with the hex code that is mispredicted and ran PMI on the libraries. If the instruction has 2 bytes VEX prefix, it starts with C5 as seen in this code, otherwise it starts with C4. All the mispredicted instructions started withC5 as seen in attached summary.txt.

summary.txt

@kunalspathak
Copy link
Member

I missed reporting some more instructions that are mispredicted. They are however just around 374 of them in libraries. Here are the unique names:

crc32
imul
jae
jbe
je
jg
jge
jl
jmp
jne
jp
lea
mov
vmovnti
vprefetchnta
vprefetcht0
vprefetcht1
vprefetcht2

I guess JUMPs are expected to get mispredicted despite doing branch tensioning earlier -

if (id->idCodeSize() != JMP_SIZE_SMALL)
{
emitOffsAdj += id->idCodeSize() - JMP_SIZE_SMALL;
#ifdef DEBUG
if (emitComp->verbose)
{
printf("; NOTE: size of jump [%08X] mis-predicted\n", emitComp->dspPtr(id));
}
#endif
}

@BruceForstall
Copy link
Member

Branch shortening isn't iterative so it's possible in some cases that forward branches can be shrunk afterwards if other branches that follow were shrunk during the branch shortening pass.

@BruceForstall
Copy link
Member

The CHK/REL diffs problem is fixed by #49947

@BruceForstall
Copy link
Member

Opened #50054 to track the mis-estimated instruction sizes. Closing this.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug JitUntriaged CLR JIT issues needing additional triage
Projects
None yet
Development

No branches or pull requests

7 participants