Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] [APX] Enable additional General Purpose Registers. #108799

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Oct 11, 2024

What this PR does

  1. Add eGPR to available register on x64 in JIT and related changes to turn these on/off based on APX availability

Currently we are adding just 8 new registers so that total register number does not exceed 64. This is based on the conversation on this PR and following conclusion : link

  1. A LSRA_LIMIT_EXT_GPR_SET register stress mode to force eGPR register usage when possible.

  2. Some minor changes to turn on Rex2 encoding with eGPR

  3. Temporary changes to mask away eGPR for currently un-supported instructions - primarily ones requiring eEVEX + imul + movszx (This will be removed once we have support for these but are essentially while we do not have eEVEX support)

  4. Minor flags to gets altjit to work

Testing

  • Ran superpmi with/without APX enabled

With APX disabled

for TP/asmdiff : link

With APX enabled

ASMDIFF
image

Code size increases due to Rex2 but PerfScore improves. Note : This is with just a subset of x64 instructions(those requiring eEVEX will be given access to eGPR as part of upcoming changes) having access to eGPR and with just 8 eGPR enabled

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 11, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 11, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@BruceForstall BruceForstall added the apx Related to the Intel Advanced Performance Extensions (APX) label Oct 15, 2024
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review October 21, 2024 22:08
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Nov 4, 2024
@JulieLeeMSFT
Copy link
Member

CC @jakobbotsch and @tannergooding for code review.

@jakobbotsch
Copy link
Member

@DeepakRajendrakumaran What is the status of this PR? It's marked as ready but the description says it's built on top of #108796 that is not marked as ready.

@DeepakRajendrakumaran
Copy link
Contributor Author

@DeepakRajendrakumaran What is the status of this PR? It's marked as ready but the description says it's built on top of #108796 that is not marked as ready.

Thanks for pointing that out. It has some dependencies on other PRs - specifically the Rex2 encoding PR. Considering that, do you have a suggestion on how to mark this for now?

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Nov 20, 2024

@kunalspathak

Now that CPUID changes have merged, ran superpmi TP and I have a problem

image

Ran the scripts shared by Kunal a while back to debug why this is happening

The following is for libraries

Base: 798636572986, Diff: 837269651550, +4.8374%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                                                                                            : 7483341082 : +105.48%  : 15.71% : +0.9370%
?allocateRegistersMinimal@LinearScan@@QEAAXXZ                                                                                                              : 5166096591 : +51.73%   : 10.84% : +0.6469%
?allocateRegisters@LinearScan@@QEAAXXZ                                                                                                                     : 3501980510 : +32.45%   : 7.35%  : +0.4385%
?processKills@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                                                         : 2761837171 : +53.97%   : 5.80%  : +0.3458%
?genConsumeReg@CodeGen@@IEAA?AW4_regNumber_enum@@PEAUGenTree@@@Z                                                                                           : 2114364155 : +56.59%   : 4.44%  : +0.2647%
?TakesRex2Prefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                                                                                         : 1652787168 : NA        : 3.47%  : +0.2070%
?freeRegisters@LinearScan@@AEAAXUregMaskTP@@@Z                                                                                                             : 1645251557 : +62.83%   : 3.45%  : +0.2060%
?mergeRegisterPreferences@Interval@@QEAAX_K@Z                                                                                                              : 1424229795 : +2637.42% : 2.99%  : +0.1783%
?AddX86PrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                                      : 1332532027 : NA        : 2.80%  : +0.1669%
?AddX86PrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                         : 1247317388 : NA        : 2.62%  : +0.1562%
?gcMarkRegPtrVal@GCInfo@@QEAAXW4_regNumber_enum@@W4var_types@@@Z                                                                                           : 1236233831 : +174.95%  : 2.59%  : +0.1548%
??$select@$0A@@RegisterSelection@LinearScan@@QEAA_KPEAVInterval@@PEAVRefPosition@@@Z                                                                       : 1044477092 : +10.11%   : 2.19%  : +0.1308%
?assignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                                            : 749700826  : +42.11%   : 1.57%  : +0.0939%
?genCodeForBBlist@CodeGen@@IEAAXXZ                                                                                                                         : 707125092  : +11.03%   : 1.48%  : +0.0885%
?allocateRegMinimal@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                                                 : 704654429  : +15.88%   : 1.48%  : +0.0882%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@IUregMaskTP@@@Z                                                                                  : 658845785  : +64.48%   : 1.38%  : +0.0825%
?emitOutputInstr@emitter@@IEAA_KPEAUinsGroup@@PEAUinstrDesc@1@PEAPEAE@Z                                                                                    : 658192653  : +9.65%    : 1.38%  : +0.0824%
?emitGCregDeadUpd@emitter@@QEAAXW4_regNumber_enum@@PEAE@Z                                                                                                  : 629879757  : +107.83%  : 1.32%  : +0.0789%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                                   : 546122060  : +24.24%   : 1.15%  : +0.0684%
?emitStackPopLargeStk@emitter@@QEAAXPEAE_NEI@Z                                                                                                             : 525848563  : +104.66%  : 1.10%  : +0.0658%
?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                                                                                                    : 487696755  : +31.37%   : 1.02%  : +0.0611%
?emitGCregLiveUpd@emitter@@QEAAXW4GCtype@@W4_regNumber_enum@@PEAE@Z                                                                                        : 451135285  : +59.41%   : 0.95%  : +0.0565%
?buildPhysRegRecords@LinearScan@@AEAAXXZ                                                                                                                   : 417375644  : +52.32%   : 0.88%  : +0.0523%
?AddRexWPrefix@emitter@@QEAA_KPEBUinstrDesc@1@_K@Z                                                                                                         : 337122934  : +62.86%   : 0.71%  : +0.0422%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                                                                                         : 326871135  : +13.69%   : 0.69%  : +0.0409%
?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z                                                              : 289859613  : +3.27%    : 0.61%  : +0.0363%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                                                                                       : 287558884  : +56.87%   : 0.60%  : +0.0360%
?emitOutputRexOrSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z                                                                                : 276256843  : +10.64%   : 0.58%  : +0.0346%
?emitIns_Call@emitter@@QEAAXW4EmitCallType@1@PEAUCORINFO_METHOD_STRUCT_@@PEAX_JW4emitAttr@@AEBQEA_KUregMaskTP@@6AEBVDebugInfo@@W4_regNumber_enum@@8I3_N9@Z : 251568991  : +17.79%   : 0.53%  : +0.0315%
?resetAllRegistersState@LinearScan@@AEAAXXZ                                                                                                                : 250671960  : +48.42%   : 0.53%  : +0.0314%
?emitUpdateLiveGCregs@emitter@@QEAAXW4GCtype@@UregMaskTP@@PEAE@Z                                                                                           : 236180536  : +61.03%   : 0.50%  : +0.0296%
?BuildNode@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                                : 211945171  : +3.63%    : 0.44%  : +0.0265%
?genUpdateRegLife@CodeGenInterface@@QEAAXPEBVLclVarDsc@@_N1@Z                                                                                              : 208334297  : +146.29%  : 0.44%  : +0.0261%
?unassignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVRefPosition@@@Z                                                                                       : 204285611  : +8.70%    : 0.43%  : +0.0256%
?BuildCall@LinearScan@@AEAAHPEAUGenTreeCall@@@Z                                                                                                            : 201972715  : +19.34%   : 0.42%  : +0.0253%
?genProduceReg@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                                               : 156386903  : +5.43%    : 0.33%  : +0.0196%
?emitGetGCRegsSavedOrModified@emitter@@QEAA?AUregMaskTP@@PEAUCORINFO_METHOD_STRUCT_@@@Z                                                                    : 155613852  : NA        : 0.33%  : +0.0195%
??$resolveRegisters@$00@LinearScan@@QEAAXXZ                                                                                                                : 154302371  : +4.84%    : 0.32%  : +0.0193%
??$compChangeLife@$00@Compiler@@QEAAXAEBQEA_K@Z                                                                                                            : 150051997  : +21.15%   : 0.31%  : +0.0188%
?genPushCalleeSavedRegisters@CodeGen@@IEAAXXZ                                                                                                              : 136488370  : +268.86%  : 0.29%  : +0.0171%
?BuildRMWUses@LinearScan@@AEAAHPEAUGenTree@@00_K1@Z                                                                                                        : 119460162  : NA        : 0.25%  : +0.0150%
?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                                                                                                          : 113904280  : +11.97%   : 0.24%  : +0.0143%
??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                                                                                                               : 99510485   : +3.12%    : 0.21%  : +0.0125%
?BuildIndir@LinearScan@@AEAAHPEAUGenTreeIndir@@@Z                                                                                                          : 96091030   : +48.43%   : 0.20%  : +0.0120%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                                                                                           : 89279923   : +9.31%    : 0.19%  : +0.0112%
?instGen_Set_Reg_To_Imm@CodeGen@@QEAAXW4emitAttr@@W4_regNumber_enum@@_JW4insFlags@@@Z                                                                      : 78050532   : +26.93%   : 0.16%  : +0.0098%
?resolveLocalRef@LinearScan@@AEAAXPEAUBasicBlock@@PEAUGenTreeLclVar@@PEAVRefPosition@@@Z                                                                   : 74859133   : +3.74%    : 0.16%  : +0.0094%
??$allocateReg@$0A@@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                                                 : 74540254   : +6.75%    : 0.16%  : +0.0093%
memset                                                                                                                                                     : 73679442   : +1.13%    : 0.15%  : +0.0092%
?emitOutputRI@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@@Z                                                                                                      : 67994420   : +6.26%    : 0.14%  : +0.0085%
?insEncodeReg012@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z                                                                      : 65961952   : +6.54%    : 0.14%  : +0.0083%
?genSetRegToConst@CodeGen@@IEAAXW4_regNumber_enum@@W4var_types@@PEAUGenTree@@@Z                                                                            : 63572129   : +16.91%   : 0.13%  : +0.0080%
?emitInsSizeSV@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                                                                                                        : 58355992   : +5.91%    : 0.12%  : +0.0073%
?BuildDefWithKills@LinearScan@@AEAAXPEAUGenTree@@H_KUregMaskTP@@@Z                                                                                         : 56553707   : +40.78%   : 0.12%  : +0.0071%
?BuildCast@LinearScan@@AEAAHPEAUGenTreeCast@@@Z                                                                                                            : 56461244   : NA        : 0.12%  : +0.0071%
?BuildStoreLocDef@LinearScan@@AEAAXPEAUGenTreeLclVarCommon@@PEAVLclVarDsc@@PEAVRefPosition@@H@Z                                                            : 53688042   : +14.79%   : 0.11%  : +0.0067%
?emitOutputRR@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@@Z                                                                                                      : 53279956   : +3.55%    : 0.11%  : +0.0067%
?genCallInstruction@CodeGen@@IEAAXPEAUGenTreeCall@@@Z                                                                                                      : 50084108   : +5.82%    : 0.11%  : +0.0063%
?emitHandleMemOp@emitter@@AEAAXPEAUGenTreeIndir@@PEAUinstrDesc@1@W4insFormat@1@W4instruction@@@Z                                                           : -58626864  : -10.34%   : 0.12%  : -0.0073%
?getMatchingConstants@LinearScan@@AEAA_K_KPEAVInterval@@PEAVRefPosition@@@Z                                                                                : -79107557  : -100.00%  : 0.17%  : -0.0099%
?emitSizeOfInsDsc_CNS@emitter@@AEBA_KPEAUinstrDesc@1@@Z                                                                                                    : -90499395  : -98.48%   : 0.19%  : -0.0113%
?BuildRMWUses@LinearScan@@AEAAHPEAUGenTree@@00_K@Z                                                                                                         : -120949346 : -100.00%  : 0.25%  : -0.0151%
?BuildGCWriteBarrier@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                      : -146449406 : -100.00%  : 0.31%  : -0.0183%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                                          : -188074386 : -3.81%    : 0.39%  : -0.0235%
?addKillForRegs@LinearScan@@AEAAXUregMaskTP@@I@Z                                                                                                           : -213435792 : -100.00%  : 0.45%  : -0.0267%
?BuildSimple@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                              : -345016623 : -99.92%   : 0.72%  : -0.0432%
?genCodeForTreeNode@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                                          : -414160388 : -6.66%    : 0.87%  : -0.0519%
?updateRegisterPreferences@Interval@@QEAAX_K@Z                                                                                                             : -580317174 : -100.00%  : 1.22%  : -0.0727%
?AddSimdPrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                        : -885312893 : -100.00%  : 1.86%  : -0.1109%
?AddSimdPrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                                     : -984986225 : -100.00%  : 2.07%  : -0.1233%

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Nov 26, 2024

Trying to further make sure the Rex2 changes are not causing TP regression. We can safely conclude the TP regression is from eGPR enablement

The following is with/without Rex2 changes(without reg alloc changes)

Overall (+0.08% to +0.23%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.16%
coreclr_tests.run.windows.x64.checked.mch +0.23%
libraries.crossgen2.windows.x64.checked.mch +0.14%
libraries.pmi.windows.x64.checked.mch +0.11%
libraries_tests.run.windows.x64.Release.mch +0.18%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.12%
smoke_tests.nativeaot.windows.x64.checked.mch +0.08%
MinOpts (+0.28% to +0.48%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.48%
coreclr_tests.run.windows.x64.checked.mch +0.36%
libraries.crossgen2.windows.x64.checked.mch +0.38%
libraries.pmi.windows.x64.checked.mch +0.37%
libraries_tests.run.windows.x64.Release.mch +0.47%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.40%
smoke_tests.nativeaot.windows.x64.checked.mch +0.28%
FullOpts (+0.08% to +0.14%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.10%
coreclr_tests.run.windows.x64.checked.mch +0.14%
libraries.crossgen2.windows.x64.checked.mch +0.14%
libraries.pmi.windows.x64.checked.mch +0.11%
libraries_tests.run.windows.x64.Release.mch +0.10%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.12%
smoke_tests.nativeaot.windows.x64.checked.mch +0.08%

With Rex2 as base and eGPR changes as diff

Overall (+3.60% to +4.65%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +4.33%
coreclr_tests.run.windows.x64.checked.mch +4.65%
libraries.crossgen2.windows.x64.checked.mch +4.29%
libraries.pmi.windows.x64.checked.mch +3.76%
libraries_tests.run.windows.x64.Release.mch +4.65%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +3.60%
smoke_tests.nativeaot.windows.x64.checked.mch +3.66%
MinOpts (+6.09% to +8.79%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +8.27%
coreclr_tests.run.windows.x64.checked.mch +6.09%
libraries.crossgen2.windows.x64.checked.mch +7.27%
libraries.pmi.windows.x64.checked.mch +6.86%
libraries_tests.run.windows.x64.Release.mch +8.42%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +7.16%
smoke_tests.nativeaot.windows.x64.checked.mch +8.79%
FullOpts (+3.47% to +4.29%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +3.59%
coreclr_tests.run.windows.x64.checked.mch +3.60%
libraries.crossgen2.windows.x64.checked.mch +4.29%
libraries.pmi.windows.x64.checked.mch +3.76%
libraries_tests.run.windows.x64.Release.mch +3.47%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +3.52%
smoke_tests.nativeaot.windows.x64.checked.mch +3.66%

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the first pass, need to evaluate where TP regression is coming from. However, I still see some asmdiffs...can you please fix it?

@@ -12534,6 +12555,9 @@ void LinearScan::verifyResolutionMove(GenTree* resolutionMove, LsraLocation curr
LinearScan::RegisterSelection::RegisterSelection(LinearScan* linearScan)
{
this->linearScan = linearScan;
#if defined(TARGET_AMD64)
rbmAllInt = linearScan->compiler->get_RBM_ALLINT();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we need it here instead of LinearScan ctor (which you are already doing)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -742,6 +743,7 @@ class emitter
// The instrDescCGCA struct's member keeping the GC-ness of the first return register is _idcSecondRetRegGCType.
GCtype _idGCref : 2; // GCref operand? (value is a "GCtype")

#if !defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment - having _idReg1/_idReg2 here with increased size caused padding and increased size even more

@@ -62,7 +62,12 @@ bool regMaskTP::IsRegNumInMask(regNumber reg, var_types type) const
//
void regMaskTP::AddGprRegs(SingleTypeRegSet gprRegs)
{
// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support.
#if defined(TARGET_AMD64)
assert((gprRegs == RBM_NONE) || ((gprRegs & RBM_ALLINT_STATIC_ALL) != RBM_NONE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-APX machines, gpr will still be 0-15 and with this assert, we will allow float register to get set, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-APX machines, gpr will still be 0-15 and with this assert, we will allow float register to get set, right?

Not really. On both APX and non-apx machines bits 0-23 will be eGPR and 24-55 SIMD. We just make sure that 16-23 are not used for non APX machines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just make sure that 16-23 are not used for non APX machines

how are we making sure? worth adding some asserts.


// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. Deprecated????
#if defined(TARGET_AMD64)
sprintf_s(regmask, cchRegMask, REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need RBM_ALLINT_STATIC_ALL here? it should just use RBM_ALLINT and it should return the right mask depending on if high int registers are available or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RBM_ALLINT and it should return the right mask depending on if high int registers are available or not - I'm not sure we can do that here. RBM_ALLINT calls get_RBM_ALLINT(). One way to make it work would be to move this method to part of compiler class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it...RBM_ALLINT_STATIC_ALL should be the one we should be using and we can have it for both x86 and x64 for consistency.
Alternatively, if you decide to add rbmAllInt on x86, we can just use RBM_ALLINT here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. These are used by GC
// exclusively
#if defined(TARGET_AMD64)
printf(REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here...can just use RBM_ALLINT?

@@ -3136,4 +3347,51 @@ inline SingleTypeRegSet LinearScan::BuildEvexIncompatibleMask(GenTree* tree)
#endif
}

inline bool LinearScan::DoesThisUseGPR(GenTree* op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add method docs for this and below method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method docs please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return false;
}

inline SingleTypeRegSet LinearScan::BuildApxIncompatibleGPRMask(GenTree* tree, bool forceLowGpr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the goal of this method?

SingleTypeRegSet op1Candidates = candidates;
SingleTypeRegSet op2Candidates = candidates;
int srcCount = 0;
// SingleTypeRegSet op1Candidates = candidates;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lot of such comments in this file. can you please delete them?


// We are dealing exclusively with HWIntrinsics here
return (op->AsHWIntrinsic()->OperIsBroadcastScalar() ||
(op->AsHWIntrinsic()->OperIsMemoryLoad() && DoesThisUseGPR(op->AsHWIntrinsic()->Op(1))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only care if Op(1) uses GPR, not any other operand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. For xarch, Op(1) is the memory address for nodes satifying GenTreeHWIntrinsic::OperIsMemoryLoad with the exception of 4 intrinsics(those 4 will not use this). And GPR is likely to be used only during mem addressing in these cases

else
{
// ToDo-APX : imul currently doesn't have rex2 support. So, cannot use R16-R31.
dstCandidates = BuildApxIncompatibleGPRMask(tree, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calls to BuildApxIncompatibleGPRMask for many nodes seems expensive. Wondering if we can do something like:

  1. at the top just set SingleTypeRegSet incompatibleGprMask = compiler->canUseApxEncoding() ? lowGPRRegs() : RBM_NONE;
  2. Places where you are passing forceLowGpr= true can instead just use incompatibleGprMask.
  3. Places where you are not forcing lowGPr, can just use DoesThisUseGPR(tree) ? incompatibleGprMask : RBM_NONE

Also, might worth caching the value of lowGPRRegs() because currently it is evaluated every time to be (availableIntRegs & RBM_LOWINT.GetIntRegSet()) and I see lowGprRegs() is used at lot of places.

@kunalspathak
Copy link
Member

It seems from your latest change, there are still asmdiffs coming up. I think there are places in emitxarch.cpp that still rely on REG_R31 instead of get_REG_INT_LAST. Also, I am little surprised with the tpdiff numbers. Locally when I run tpdiff for asp.net collection, I get these numbers:

image

vs. what CI is showing

image

what does it show for you locally?

@kunalspathak
Copy link
Member

Also, I am little surprised with the tpdiff number

That was happening because my environment was not setup correctly. I can now see the same TP diffs that is shown in CI.

@DeepakRajendrakumaran
Copy link
Contributor Author

Also, I am little surprised with the tpdiff number

That was happening because my environment was not setup correctly. I can now see the same TP diffs that is shown in CI.

This reduced it by somewhere around 0.8%. Without that change for comparison - https://github.com/dotnet/runtime/pull/111004/checks?check_run_id=34998081643

@kunalspathak
Copy link
Member

Just a note that the TP regression we see here will impact not only non-APX machines but also AMD machines which do not have APX feature. We should add that consideration too while working on this on how we can reduce or have no impact on AMD.

@DeepakRajendrakumaran
Copy link
Contributor Author

@tannergooding @kunalspathak The current plan to is do the following

  • Modify this PR to enable only a limited number of eGPR so that regMaskTP remains a uint64_t for now(I updated PR to use only r0-r23 but I see a bunch of failures which I'm trying to fix)

As a follow up, do the following

  • Rework regMaskTP to utilize 64 bits for GPR/EGPR, SIMD, KMASK, with the special registers hardcoded
  • Based on that implementation, address any low hanging optimizations.
    -Look into Tanner's other suggestions to rework regMaskTP to use full set of registers for APX machines

@kunalspathak @tannergooding I have made the required changes. Can you guys please review this now?

@@ -5725,7 +5732,11 @@ void CodeGen::genFnProlog()

if (initRegs)
{
#ifdef TARGET_AMD64
for (regNumber reg = REG_INT_FIRST; reg <= REG_INT_LAST_APX_AWARE; reg = REG_NEXT(reg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also have get_REG_LAST_INT on x86 and not have this #ifdef-else? For x86, it will be just set to last REG INT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common code path for all targets including arm. I can make it so that I expose get_REG_INT_LAST() for all targets and just return REG_INT_LAST for everything other than AMD64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to make the proposed change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed this.

#if defined(TARGET_AMD64)
// TODO-Xarch-apx : Revert. Excluding eGPR so that it's not used for non REX2 supported movs. Revisit this one.
// Might not be necessary.
regNumber tmpReg = internalRegisters.GetSingle(tree, RBM_ALLINT_INIT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just use RBM_ALL_INT here? it will expand to get_RBM_ALLINT() which should give right set of registers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since x86 doesnt have one, don't think the TP impact will be that much + it will make code little cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test this out. The intent was not to use eGPR here. But I might eb able to work around it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this. It is not needed


// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. Deprecated????
#if defined(TARGET_AMD64)
sprintf_s(regmask, cchRegMask, REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it...RBM_ALLINT_STATIC_ALL should be the one we should be using and we can have it for both x86 and x64 for consistency.
Alternatively, if you decide to add rbmAllInt on x86, we can just use RBM_ALLINT here.

//
void regMaskTP::AddGprRegs(SingleTypeRegSet gprRegs)
void regMaskTP::AddGprRegs(SingleTypeRegSet gprRegs, regMaskTP availableIntRegs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the callers are passing RBM_ALLINT to this method, so perhaps we do not need it and can just use the existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that 'RBM_ALLINT' uses get_RBM_ALLINT() and that's not available at compile time or from regMaskTP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this here because I for some reason cannot respond to the above comment.

rbmAllInt can be accessed by methods which are part of classes which have rbmAllInt. There is not really a way for global methods to access these. That's why I'm using RBM_ALLINT_STATIC_ALL

regNumber AbsRegNumber(regNumber reg)
{
assert(reg < REG_STK);
if ((reg >= XMMBASE) && (reg < KBASE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can eliminate one condition by writing it:

if (reg >= KBASE)
{
  return (regNumber)(reg - KBASE);
}
else if (reg >= XMMBASE)
{
  return (regNumber)(reg - XMMBASE);
}
else
{
  return reg;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -5910,7 +5935,12 @@ void emitter::emitIns_R(instruction ins, emitAttr attr, regNumber reg)
noway_assert(emitVerifyEncodable(ins, size, reg));

UNATIVE_OFFSET sz;
instrDesc* id = emitNewInstrSmall(attr);
instrDesc* id = emitNewInstrSmall(attr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change can be reverted I suppose?

Copy link
Contributor Author

@DeepakRajendrakumaran DeepakRajendrakumaran Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add this. For some reason, jit format fails in CI and the patch created has this

@@ -37,6 +37,7 @@
DOTNET_EnableSSE41;
DOTNET_EnableSSE42;
DOTNET_EnableSSSE3;
DOTNET_EnableAPX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supposed to get used below in one of the pipeline. Is it intended to not do so at this point? If yes, then may be just remove it and add back when we enable the pipeline to test it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I was not sure if needed. I added it since we provide a flag to turn it off for other similar features. I can remove it

//
// Return Value:
// updated register mask.
inline SingleTypeRegSet LinearScan::BuildApxIncompatibleGPRMask(GenTree* tree,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you give a thought on #108799 (comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that originally. Will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a small change here from the original implementation. It now takes in the current candidates and masks away the eGPRs from the input current candidates when necessary
image

Functionally do you think it makes that much difference now? The method LinearScan::BuildApxIncompatibleGPRMask itself is inlined.

These are the relevant cases I can think of

  1. Non TARGET_AMD64 machines - it'll directly return the candidates. This shouldn't have any effect here
  2. On TARGET_AMD64 machines with APX not supported. Returns candidates after the first check. I might be able to not do this check everytime by caching compiler->canUseApxEncoding()
  3. On TARGET_AMD64 machines with APX not supported.
    Here we determine what to do depending on if buildNode at this point already has some determined candidate(we have some cases where ecx is the only candidate for example and I don't want to return all low GPRs)

if (forceLowGpr || DoesThisUseGPR(tree)) { if (candidates == RBM_NONE) { return lowGPRRegs(); } else { return (candidates & lowGPRRegs()); } }

So, are you okay with caching just compiler->canUseApxEncoding()?

Hopefully I haven't missed something else very obvious

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well cache lowGPRegs() as well. yeah, overall don't think it might make much difference and the fact that it is inlined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached compiler->canUseApxEncoding(). lowGPRegs() I have left as is for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowGPRegs() I have left as is for now

shouldn't be too hard to also cache it in LinearScan class, right? It is used lot of times during building intervals, so might save little bit up TP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it but unfortunately it breaks some functionality. getLowGprRegs()(I updated the function name to match other get functions) use availableIntRegs

image

There are some cases where availableIntRegs is modified and this breaks code(Eg :

if ((removeMask != RBM_NONE) && ((availableIntRegs & removeMask) != 0))
{
// We know that we're already in "read mode" for availableIntRegs. However,
// we need to remove these registers, so subsequent users (like callers
// to allRegs()) get the right thing. The RemoveRegistersFromMasks() code
// fixes up everything that already took a dependency on the value that was
// previously read, so this completes the picture.
availableIntRegs.OverrideAssign(availableIntRegs & ~removeMask);
}
). So just caching it once when initially set is not enough. That's my reasoning for not caching it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhm, I see that getLowGprRegs() is mainly used while building intervals and the code above is executed right before we start building.

@@ -2960,7 +3114,17 @@ int LinearScan::BuildIndir(GenTreeIndir* indirTree)
else
#endif
{
srcCount += BuildOperandUses(source);
GenTree* data = indirTree->Data();
if (data->isContained() && (data->OperIs(GT_BSWAP, GT_BSWAP16) /* || data->OperIsHWIntrinsic()*/) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -3136,4 +3347,51 @@ inline SingleTypeRegSet LinearScan::BuildEvexIncompatibleMask(GenTree* tree)
#endif
}

inline bool LinearScan::DoesThisUseGPR(GenTree* op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method docs please.

@@ -425,6 +425,7 @@ RELEASE_CONFIG_INTEGER(EnableSSE3_4, "EnableSSE3_4",
RELEASE_CONFIG_INTEGER(EnableSSE41, "EnableSSE41", 1) // Allows SSE4.1+ hardware intrinsics to be disabled
RELEASE_CONFIG_INTEGER(EnableSSE42, "EnableSSE42", 1) // Allows SSE4.2+ hardware intrinsics to be disabled
RELEASE_CONFIG_INTEGER(EnableSSSE3, "EnableSSSE3", 1) // Allows SSSE3+ hardware intrinsics to be disabled
RELEASE_CONFIG_INTEGER(EnableAPX, "EnableAPX", 1) // Allows APX+ features to be disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does APX automatically light-up on an APX capable machine?

I think we need to configure APX as disabled by default, and require a user to "opt-in" to APX support by enabling a configuration parameter. At least until we are able to test thoroughly on actual hardware. In particular, I don't expect we will "get there" (i.e., have hardware and do enough testing on it) to enable APX by default for .NET 10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this flag is mostly for testing purposes for 'turning off' features. Since we do not have any testing pipelines currently, I can go ahead and remove this as per Kunal's comment here - #108799 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have these configs. But I wonder if the current default should be 0, not 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. This config by itself doesn't turn APX on in 'normal workflow'. It has 2 purposes

Use 1 : Link. It's a knob to turn APX on when using altjit unless otherwise specified(link). Out current design is to turn all available features ON with altjit

Use 2 : A knot to turn APX OFF even on machines supporting APX(link

So having the default be 1 doesn't affect the functionality on non APX machines in any way. Having it be 0 would mean manually setting it to 1 to run altjit tests. But considering this is a feature in development and altjit is used for testing, it makes sense to have the default as 0 if we keep it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to get @jkotas weigh in here.

APX is a lot more involved than AVX10.1 was and is a lot of "net new" handling, so the risk is higher. However, I imagine Intel (@DeepakRajendrakumaran to confirm) will be doing local testing (full test suite, stress modes, some important libraries, etc) on actual hardware as was done for other ISAs in the past, which will help build confidence.

We can notably always change the value here in a patch as well, whether from 0->1 or 1->0 and we'd fix any bugs as we normally do otherwise. So it really just comes down to the default experience we want for devs who buy APX capable CPUs on launch day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have actual hardware yet obviously but we will be doing extensive testing once we have it

For now, I have done the following with APX ON

  • Ran all tests under JIT subtree with src\tests\build using sde with APX ON
  • Ran superpmi asmdiffs with APX ON to make sure there are no decode fails or asserts as well as seeing perf scores(This is added here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, will we have some CI jobs that will test the APX code paths whenever they are touched? something with altjit route?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to get @jkotas weigh in here.

I agree with what @BruceForstall said above.

For altjit, I do not have a strong opinion. I think it would be more intuitive to have it disabled by default so that altjit behavior matches the typical configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with that. Will make the change

@@ -43,7 +43,7 @@ inline static bool isHighGPReg(regNumber reg)
#ifdef TARGET_AMD64
// TODO-apx: the definition here is incorrect, we will need to revisit this after we extend the register definition.
// for now, we can simply use REX2 as REX.
return ((reg >= REG_R8) && (reg <= REG_R15));
return ((reg >= REG_R16) && (reg <= REG_R23));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at all the users, but I'm surprised this wouldn't cause diffs. "high GPR" means something for x64 versus x86. Maybe there should be a separate isApxEgpr()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a method added by Ruihan specifically for APX/Rex2/eGPR(3410c76#diff-782ed843d790c8cf94cba03c0d408a37c64cdba7832dbb5a560f76979355bdd2R41-R51). He initially just used REG_R8 and REG_R15 since we didn't have REG_R16 and above. This code would have been inactive till now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes into play only if/when register allocator selects eGPR and emitter has to encode REX2 for eGPR

@@ -10002,6 +10002,11 @@ class Compiler
//
bool canUseApxEncoding() const
{
if (JitConfig.EnableAPX() == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way to implement a config switch for the non-altjit case.

It should be done here instead - similar how other existing instruction sets are handled:

CPUCompileFlags.Set(InstructionSet_APX);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have made the change

@@ -732,6 +732,7 @@ RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableSSE41, W("EnableSSE41")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableSSE42, W("EnableSSE42"), 1, "Allows SSE4.2+ hardware intrinsics to be disabled")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableSSSE3, W("EnableSSSE3"), 1, "Allows SSSE3+ hardware intrinsics to be disabled")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableX86Serialize, W("EnableX86Serialize"), 1, "Allows X86Serialize+ hardware intrinsics to be disabled")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAPX, W("EnableAPX"), 1, "Allows APX+ features to be disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAPX, W("EnableAPX"), 1, "Allows APX+ features to be disabled")
RETAIL_CONFIG_DWORD_INFO(EXTERNAL_EnableAPX, W("EnableAPX"), 0, "Allows APX+ features to be disabled")

The APX+ features should be disabled by default in the shipping runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this originally. Has been changed to 0

@kunalspathak
Copy link
Member

Can you double check why there is TP difference for arm64?

image

@DeepakRajendrakumaran
Copy link
Contributor Author

Can you double check why there is TP difference for arm64?

image

It's this commit - commit

Able to verify with this draft PR - link
After reverting the commit
image

@kunalspathak
Copy link
Member

There are still some left over diffs in arm64 and I really expect that they should be zero.

image

I am guessing they are happing because of using get_RBM_INT_LAST in codegencommon.cpp. You can point #define RBM_INT_LAST get_RBM_INT_LAST for AMD64, and then just use it as RBM_ALL_INT instead. That's how we use it at other places.

@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the enableeGPR branch 5 times, most recently from 99438b6 to dc3a1e8 Compare January 30, 2025 19:20
@BruceForstall
Copy link
Member

Looks like no asm diffs now. Still some small x64 TP regression but I assume that is inevitable and expected.

fyi @DeepakRajendrakumaran there are merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apx Related to the Intel Advanced Performance Extensions (APX) area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants