JIT instruction sizes are mis-estimated #50054

BruceForstall · 2021-03-22T19:04:18Z

A lot of discussion about mis-estimated instruction sizes happened in #8748.

This issue covers the problem of instruction sizes being mis-estimated, which leads to wasted allocated code space memory.

Also related: #12480

According to #8748 , mis-estimated instruction sizes include:

vaddsd
vcvttsd2si
vmovaps
vmovd
vmovdqa
vmovdqu
vmovss
vmovupd
vmulsd
vpackuswb
vpaddw
vpand
vpandn
vpbroadcastb
vpbroadcastd
vpbroadcastw
vpcmpeqb
vpcmpeqw
vpmovmskb
vpor
vpsubw
vpunpcklbw
vucomisd
vxorps

crc32
imul
jae
jbe
je
jg
jge
jl
jmp
jne
jp
lea
mov
vmovnti
vprefetchnta
vprefetcht0
vprefetcht1
vprefetcht2

category:correctness
theme:intrinsics
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

BruceForstall · 2021-09-28T19:30:26Z

cc @kunalspathak

kunalspathak · 2022-12-28T09:23:38Z

Existing design

Currently there are 4 steps that occur in code generation and here is a rough outline of what happens:

genGenerateMachineCode(): This phase iterates over all the GenTree* nodes (see genCodeForTreeNode()) and translates them to machine dependent data-structure instrDesc. They are not necessarily 1-to-1, but some GenTree* can produce more than one instrDesc* while others can skip it. Just as GenTree* maps to instrDesc*, during this phase, so does BasicBlock* maps to instrGroup* (instruction group), which is a collection of instrDesc* we create during this phase. instrDesc* are first created in a temporary buffer. Depending on the instruction, various fields are set in instrDesc* include registers, instruction, instruction format, etc. We also estimate the instruction size of each instruction that is saved in instrDesc*. Once we complete processing all the instructions for a single instrGroup*, we save the raw data of all instrDesc* in a memory region allocated in ig->igData field. (See emitSavIG()). Basically, we memcpy all the instrDsc* data (contiguous raw memory contents) created so far in ig->igData.
At this point, since we have instruction size of all instrDesc*, we allocate equivalent memory from runtime. This is the final code buffer where instruction encoding will be saved for execution.
genEmitMachineCode() : This phase iterates over each instrGroup* and then over each instrDesc* (which is present in ig->igData). It extracts all the metadata present in instrDesc* (See emitIssue1Instr()) and depending on the insFmt (instruction format), calls the appropriate emitOutput*() method to generate the raw encoding for that instruction. The raw encoding is saved in the code buffer that we requested in step 2 above. In this step, we also update the GC refs to reflect the true code address where GC refs goes live/dead.
Finally, we adjust any forward jump addresses to reflect the final code location.

Proposal

If we notice, step 1 has all the information needed for an instruction to be encoded. Step 3 uses this exact information to determine the encoding needed. The proposal is to combine step 1 and 3 in following manner:

A1. If we see emitOutputInstr(), it has a big switch-case depending on the instruction format present in instrDesc*. During step 1, since we already have all the information, we could just call various emitOutput*() methods. The only challenge at that point is we do not have final code memory where the encoded instructions will be stored. We can get around the problem by having a growable memory say ig->finalCode at instrGroup* level (See C1 below). The encoding can be accumulated in ig->finalCode during step 1 itself.

A2. After above step, we would perform step 2, just as it happens currently. Except, in this case, we would have accurate memory size from ig->finalCode that should be allocated from the runtime, lets call it finalCodeBuffer.

A3. Traverse ig->finalCode of all instrGroup* and memcpy() it to final code memory that we allocated in A2 above.

A4. In A1 step, we will have to somehow track the GC liveness by building a table (See C2 below) of which codeAddress (memory of ig->finalCode) needs update. Later, once code is moved to final code memory, we would have to write a routine which will efficiently map the codeAddress from ig->finalCode to finalCodeBuffer and search/replace all the addresses based on the mapping.

A5. Forward jump addresses will be addressed, the one they are fixed today.

Refactoring

B1. emitOutputInstr*() should be segregated based upon the instruction format that they handle. For example, here is a piece of code from emitxarch.cpp emitOutputInstr() method:

        case IF_ARD_RRD:
        case IF_AWR_RRD:
        case IF_ARW_RRD:
            code    = insCodeMR(ins);
            code    = AddSimdPrefixIfNeeded(ins, code, size);
            regcode = (insEncodeReg345(ins, id->idReg1(), size, &code) << 8);
            dst     = emitOutputAM(dst, id, code | regcode);
            sz      = emitSizeOfInsDsc(id);
            break;

IMO, we should just move such code in a separate method with appropriate name. In A1, we would just call the relevant methods, since we know what is the insFmt and instruction we are setting, 2 things that we check in above code.

B2. GC update related code should just be in a separate file IMO, today it is mixed with emitter code.

Data structures

C1. We could have two ways in which finalCode can be stored. [1] A growable contiguous memory of X bytes. Once X bytes are filled, we allocate 2X bytes, copy the contents of current memory to new memory and continue writing to the new memory. We continue growing it by 4X, 8X, etc. I think this option has a downside of over allocating memory after one point and additional cost of memcpy, everytime the buffer is grown. [2] A simple linked list of fixed size, which when filled, we just allocate next chunk and continue write instruction encoding in it. In step A3, we would just traverse the linked list and copy the contents to finalCodeBuffer.

C2. To track GC refs, currently the addr is used just to find the offset from the start. Once we start using ig->finalCode, we should be able to track the offset from the start, and might not need readjustment, as far as I understand the code. May be Bruce can correct me here.

tannergooding · 2022-12-28T17:07:20Z

We may want to regenerate the list of misestimated instructions. There have been a few changes/improvements in the area.

BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 22, 2021

BruceForstall added this to the Future milestone Mar 22, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 22, 2021

BruceForstall mentioned this issue Mar 22, 2021

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

Closed

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Mar 22, 2021

BruceForstall mentioned this issue Jun 14, 2022

Common place to calculate instruction size so we do not mispredict it #57368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT instruction sizes are mis-estimated #50054

JIT instruction sizes are mis-estimated #50054

BruceForstall commented Mar 22, 2021 •

edited

Loading

BruceForstall commented Sep 28, 2021

kunalspathak commented Dec 28, 2022

tannergooding commented Dec 28, 2022

JIT instruction sizes are mis-estimated #50054

JIT instruction sizes are mis-estimated #50054

Comments

BruceForstall commented Mar 22, 2021 • edited Loading

BruceForstall commented Sep 28, 2021

kunalspathak commented Dec 28, 2022

Existing design

Proposal

Refactoring

Data structures

tannergooding commented Dec 28, 2022

BruceForstall commented Mar 22, 2021 •

edited

Loading