-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748
Comments
This is follow-on work inspired by #8735, looking for more examples of CHK/REL divergence. |
Any reason the buffer capacity must be different? Wouldn't it be better for them to be the same since we use CHK as a test proxy for REL. |
I think that was the intent. The buffer holds The It would be better to pick a buffer large enough to hold N of the largest instrDescs and then induce a group break when the instruction count reaches N. (update to emphasize we're talking about instrDesc sizes, not the actual instruction sizes) |
That might improve CHK/REL consistency, but would be memory inefficient.
Isn't the solution here to make sure all offsets take into account the currently computed adjustment (due to mis-estimation)? I assume the issue here is we don't have the correct "base" address. For forward references, we already walk a list of labels to update the offsets after emission, right? (Maybe I should go read code before commenting more...) |
The known underlying issues were all fixed but we haven't gotten around to automating testing or validation. Also I have only really looked at x64 in detail. And without some automation it very well could regress. Still interested in doing all this but likely not happening in time for 2.1. So will push it back to Future and reconsider once we're scouting what should go into 2.2 or whatever it is that comes next. |
I found some more cases where this is happening. E.g. For I was planning to use the estimated size to predict if loop alignment is needed or not in #44370. If I can predict early, I can save allocating extra bytes for |
There is a correctness issue, that is kind of duplicate for this one: #12840 There are some useful comments in the PR and a raw prototype, the main issue was TP regression, that was around 0.66% if I remember correctly. Also the win in terms of code size was applicable only to Crossgen, not CoreRun, is it different now? |
It is applicable for both Crossgen and CoreRun. Crossgen in a way that it will reduce the file size and for CoreRun/Tiering JIT, it will reduce the memory allocated. From what I gathered, for .NET libraries, we allocate memory of 51824516 bytes, but generate code that is 51567864 bytes resulting in 256,652 bytes not utilized or wasted. It is not much % wise, but still a lot to reason why trimming would be necessary. |
I added some logging to figure out which instructions are overestimated and need fix-up. The way I did this was printing the instruction that was overestimated and running PMI on framework libraries. All the instructions are hardware intrinsics related and are over-estimated by 1 byte.
I also tried running crossgen and as expected I didn't see any instruction getting over-estimated because we do not generate hardware intrinsic instructions during crossgen. cc: @AndyAyersMS , @BruceForstall |
There are two issues here: instruction size mis-estimates (which leads to inefficient too-large memory request from the VM), and CHK/REL codegen differences due to IG size and buffers. Maybe they should be split (if there aren't already separate issues covering one and/or the other). cc @tannergooding @CarolEidt for the instruction size mis-estimations. |
I wonder if this is something that was missed in VEX vs non-VEX. It would be interesting to see if there are any misreports for |
If its just in VEX, I expect it has to do with the 2-byte vs 3-byte encoding. The former is only sometimes possible. |
Probably #12840 is the one for mis-estimates. I will add a link in that issue to the conversation happening here. |
The only instruction that is mis-predicted by 1 byte is |
@BruceForstall or @tannergooding - Does one of you know how much work it would be to fix the misprediction for these instructions? |
With the help of @tannergooding I was able to address misprediction for The way I made sure that VEX prefix is the source of mis-prediction is by printing every instruction, along with the hex code that is mispredicted and ran PMI on the libraries. If the instruction has 2 bytes VEX prefix, it starts with |
I missed reporting some more instructions that are mispredicted. They are however just around 374 of them in libraries. Here are the unique names:
I guess JUMPs are expected to get mispredicted despite doing branch tensioning earlier - runtime/src/coreclr/jit/emitxarch.cpp Lines 12389 to 12399 in bc1ff08
|
Branch shortening isn't iterative so it's possible in some cases that forward branches can be shrunk afterwards if other branches that follow were shrunk during the branch shortening pass. |
The CHK/REL diffs problem is fixed by #49947 |
Opened #50054 to track the mis-estimated instruction sizes. Closing this. |
The jit's emitter works by combining instructions into instruction groups (hereafter IGs). The group boundaries can be formed naturally by labels or induced by the emitter when an internal buffer fills up.
The buffer capacity varies between CHK and REL builds, and the size of the buffer entries varies depending on the instructions being emitted. So the location of the induced IG breaks is not consistent between CHK and REL. This difference in behavior can lead to differences in codegen between CHK and REL.
In the one case I've seen, the differences come about as follows. During IG formation, the jit estimates instruction sizes and hence the size and offset of each IG. When actually emitting the code the jit may discover the instructions in an IG are smaller than predicted. This leads to cascading updates of the offsets of subsequent IGs. If in such an IG there is first a mis-estimated instruction M and then an instruction L that refers to an instruction in a subsequent IG, the offset computation for L will not take into account the correction from the mis-estimated M. If instead M and L are in different groups, L's offset will have the correction.
Now because CHK and REL form IGs differently, it is possible for M and L to be together in one IG in RET but not in CHK, hence the CHK offset and REL offset differ.
I don't have a repro in CoreCLR for this yet, but here's an example bit of disassembly from a desktop test that shows the difference:
This code sequence comes from an inline pinvoke; the LEA is supposed to be storing the return address from the call. CHK and RET are inconsistent here.
It turns out both are actually wrong, and the offset should be B6E, but that is a separate issue (
#13398#8747).Since it is unlikely we can ever avoid mis-estimating sizes, the underlying algorithm to cope with mis-estimates likely needs to be rethought.
cc @dotnet/jit-contrib
category:correctness
theme:jit-coding-style
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: