Skip to content

Commit

Permalink
Align inner loops (#44370)
Browse files Browse the repository at this point in the history
* Detect inner loop and add 10 bytes of padding at the beginning

* generate nop in previous blocks

* TODO: figure out if anything needs to be done in optCanonicalizeLoop

* Add COMPlus_JitAlignLoopMinBlockWeight and COMPlus_JitAlignLoopMaxCodeSize

- Add 2 variables to control which loops get aligned
- Moved padding after the conditional/unconditional jump of previous block

* Reuse AlignLoops flag for dynamic loop alignment

* Detect back edge and count no. of instructions before doing loop alignment

* fix bugs

* propagate the basic block flag

* Switch from instrCount to codeSize

* JitAlignLoopWith32BPadding

* Add emitLoopAlign32Bytes()

* wip

* Add logic to avoid emitting nop if not needed

* fix a condition

* Several things:

- Replaced JitAlignLoopWith32BPadding with JitAlignLoopBoundary
- Added JitAlignLoopForJcc
- Added logging of boundary and point where instruction splitting happpens
- Add logic to take into consideration JCC.

* Added JitAlignLoopAdaptive algorithm

* wip

* revert emitarm64.cpp changes

* fix errors during merge

* fix build errors

* refactoring and cleanup

* refactoring and build errors fix

* jit format

* one more build error

* Add emitLoopAlignAdjustments()

* Update emitLoopAlignAdjustments to just include loopSize calc

* Remove #ifdef ADAPTIVE_LOOP_ALIGNMENT

* Code cleanup

* minor fixes

* Fix issues:
- Make sure all `align` instructions for non-adaptive fall under same IG
- Convert some variables to `unsigned short`
- Fixed the maxPadding amount for adaptive alignment calculation

* Other fixes

* Remove align_loops flag from coreclr

* Review feedback

- Do not align loop if it has call
- Created `emitSetLoopBackEdge()` to isolate `emitCurIG` inside emitter class
- Created `emitOutputAlign()` to move the align instruction output logic
- Renamed emitVariableeLoopAlign() to emitLongLoopAlign()
- Created `optIdentifyLoopsForAlignment()` to identify loops that need alignment
- Added comments at various places

* jit format

* Add FEATURE_LOOP_ALIGN

* remove special case for align

* Do not propagate BBF_LOOP_ALIGN in certain cases

* Introduce instrDescAlign and emitLastAlignedIgNum

* Several changes:

- Perform accurate padding size before outputting align instruction
- During outputting, just double check if the padding needed matches to what was calculated.
- If at any time, instruction sizes are over-estimated before the last align instruction,
  then compensate them by adding NOP.
- As part of above step, do not perform encoding "VEX prefix shortening" if there is align
  instruction in future.
- Fix edge cases where because of loop cloning or resolution phase of register allocator, the
  loops are marked such that they cover the loops that are already mark for alignment. Fix by
  resetting their IGF_LOOP_ALIGN flag.
- During loop size calculation, if the last IG also has `align` flag, then do not take into account
  the align instruction's size because they are reserved for the next loop.

* jit format

* fix issue related to needLabel

* align memory correctly in superpmi

* Few more fixes:

- emitOffsAdj takes into account for any mis-prediction of jump. If we compensate that mis-prediction, that off that adjustment.
- Record the lastAlignIG only for valid non-zero align instructions

* minor JITDUMP messages

* Review comments

* missing check

* Mark the last align IG the one that has non-zero padding

* More review comments

* Propagate BBF_LOOP_ALIGN for compacting blocks

* Handle ALIGN_LOOP flag for loops that are unrolled

* jit format

* Loop size upto last back-edge instead of first back-edge

* Take loop weight in consideration

* remove align flag if loop is no longer valid

* Adjust loop block weight to 4 instead of 8

* missing space after rebase

* fix the enum values after rebase

* review feedback

* Add missing #ifdef DEBUG
  • Loading branch information
kunalspathak authored Jan 12, 2021
1 parent f9cf601 commit 16c48d6
Show file tree
Hide file tree
Showing 24 changed files with 1,157 additions and 135 deletions.
16 changes: 15 additions & 1 deletion src/coreclr/ToolBox/superpmi/superpmi/icorjitinfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1609,7 +1609,21 @@ void MyICJI::allocMem(ULONG hotCodeSize, /* IN */
jitInstance->mc->cr->AddCall("allocMem");

// TODO-Cleanup: Could hot block size be ever 0?
*hotCodeBlock = jitInstance->mc->cr->allocateMemory(hotCodeSize);
size_t codeAlignment = sizeof(void*);
size_t hotCodeAlignedSize = static_cast<size_t>(hotCodeSize);

if ((flag & CORJIT_ALLOCMEM_FLG_32BYTE_ALIGN) != 0)
{
codeAlignment = 32;
}
else if ((flag & CORJIT_ALLOCMEM_FLG_16BYTE_ALIGN) != 0)
{
codeAlignment = 16;
}
hotCodeAlignedSize = ALIGN_UP_SPMI(hotCodeAlignedSize, codeAlignment);
hotCodeAlignedSize = hotCodeAlignedSize + (codeAlignment - sizeof(void*));
*hotCodeBlock = jitInstance->mc->cr->allocateMemory(hotCodeAlignedSize);
*hotCodeBlock = ALIGN_UP_SPMI(*hotCodeBlock, codeAlignment);

if (coldCodeSize > 0)
*coldCodeBlock = jitInstance->mc->cr->allocateMemory(coldCodeSize);
Expand Down
1 change: 0 additions & 1 deletion src/coreclr/inc/clrconfigvalues.h
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,6 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_UseIBCFile, W("UseIBCFile"), 0, "", CLRConf
///
/// JIT
///
RETAIL_CONFIG_DWORD_INFO_DIRECT_ACCESS(UNSUPPORTED_JitAlignLoops, W("JitAlignLoops"), "Aligns loop targets to 8 byte boundaries")
CONFIG_DWORD_INFO_EX(INTERNAL_JitBreakEmit, W("JitBreakEmit"), (DWORD)-1, "", CLRConfig::EEConfig_default)
CONFIG_DWORD_INFO_DIRECT_ACCESS(INTERNAL_JitDebuggable, W("JitDebuggable"), "")
#if !defined(DEBUG) && !defined(_DEBUG)
Expand Down
50 changes: 25 additions & 25 deletions src/coreclr/inc/corjitflags.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,45 +79,45 @@ class CORJIT_FLAGS
CORJIT_FLAG_BBINSTR = 29, // Collect basic block profile information
CORJIT_FLAG_BBOPT = 30, // Optimize method based on profile information
CORJIT_FLAG_FRAMED = 31, // All methods have an EBP frame
CORJIT_FLAG_ALIGN_LOOPS = 32, // add NOPs before loops to align them at 16 byte boundaries
CORJIT_FLAG_UNUSED12 = 32,
CORJIT_FLAG_PUBLISH_SECRET_PARAM = 33, // JIT must place stub secret param into local 0. (used by IL stubs)
CORJIT_FLAG_UNUSED12 = 34,
CORJIT_FLAG_UNUSED13 = 34,
CORJIT_FLAG_SAMPLING_JIT_BACKGROUND = 35, // JIT is being invoked as a result of stack sampling for hot methods in the background
CORJIT_FLAG_USE_PINVOKE_HELPERS = 36, // The JIT should use the PINVOKE_{BEGIN,END} helpers instead of emitting inline transitions
CORJIT_FLAG_REVERSE_PINVOKE = 37, // The JIT should insert REVERSE_PINVOKE_{ENTER,EXIT} helpers into method prolog/epilog
CORJIT_FLAG_UNUSED13 = 38,
CORJIT_FLAG_UNUSED14 = 38,
CORJIT_FLAG_TIER0 = 39, // This is the initial tier for tiered compilation which should generate code as quickly as possible
CORJIT_FLAG_TIER1 = 40, // This is the final tier (for now) for tiered compilation which should generate high quality code

#if defined(TARGET_ARM)
CORJIT_FLAG_RELATIVE_CODE_RELOCS = 41, // JIT should generate PC-relative address computations instead of EE relocation records
#else // !defined(TARGET_ARM)
CORJIT_FLAG_UNUSED14 = 41,
CORJIT_FLAG_UNUSED15 = 41,
#endif // !defined(TARGET_ARM)

CORJIT_FLAG_NO_INLINING = 42, // JIT should not inline any called method into this method

CORJIT_FLAG_UNUSED15 = 43,
CORJIT_FLAG_UNUSED16 = 44,
CORJIT_FLAG_UNUSED17 = 45,
CORJIT_FLAG_UNUSED18 = 46,
CORJIT_FLAG_UNUSED19 = 47,
CORJIT_FLAG_UNUSED20 = 48,
CORJIT_FLAG_UNUSED21 = 49,
CORJIT_FLAG_UNUSED22 = 50,
CORJIT_FLAG_UNUSED23 = 51,
CORJIT_FLAG_UNUSED24 = 52,
CORJIT_FLAG_UNUSED25 = 53,
CORJIT_FLAG_UNUSED26 = 54,
CORJIT_FLAG_UNUSED27 = 55,
CORJIT_FLAG_UNUSED28 = 56,
CORJIT_FLAG_UNUSED29 = 57,
CORJIT_FLAG_UNUSED30 = 58,
CORJIT_FLAG_UNUSED31 = 59,
CORJIT_FLAG_UNUSED32 = 60,
CORJIT_FLAG_UNUSED33 = 61,
CORJIT_FLAG_UNUSED34 = 62,
CORJIT_FLAG_UNUSED35 = 63
CORJIT_FLAG_UNUSED16 = 43,
CORJIT_FLAG_UNUSED17 = 44,
CORJIT_FLAG_UNUSED18 = 45,
CORJIT_FLAG_UNUSED19 = 46,
CORJIT_FLAG_UNUSED20 = 47,
CORJIT_FLAG_UNUSED21 = 48,
CORJIT_FLAG_UNUSED22 = 49,
CORJIT_FLAG_UNUSED23 = 50,
CORJIT_FLAG_UNUSED24 = 51,
CORJIT_FLAG_UNUSED25 = 52,
CORJIT_FLAG_UNUSED26 = 53,
CORJIT_FLAG_UNUSED27 = 54,
CORJIT_FLAG_UNUSED28 = 55,
CORJIT_FLAG_UNUSED29 = 56,
CORJIT_FLAG_UNUSED30 = 57,
CORJIT_FLAG_UNUSED31 = 58,
CORJIT_FLAG_UNUSED32 = 59,
CORJIT_FLAG_UNUSED33 = 60,
CORJIT_FLAG_UNUSED34 = 61,
CORJIT_FLAG_UNUSED35 = 62,
CORJIT_FLAG_UNUSED36 = 63
};

CORJIT_FLAGS()
Expand Down
10 changes: 5 additions & 5 deletions src/coreclr/inc/jiteeversionguid.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@
//
//////////////////////////////////////////////////////////////////////////////////////////////////////////

constexpr GUID JITEEVersionIdentifier = { /* 8e32c24d-62fe-4d78-ae73-eedddb928ee2 */
0x8e32c24d,
0x62fe,
0x4d78,
{0xae, 0x73, 0xee, 0xdd, 0xdb, 0x92, 0x8e, 0xe2}
constexpr GUID JITEEVersionIdentifier = { /* de81f48e-7701-45f2-a91b-1914f88dfd11 */
0xde81f48e,
0x7701,
0x45f2,
{0xa9, 0x1b, 0x19, 0x14, 0xf8, 0x8d, 0xfd, 0x11}
};

//////////////////////////////////////////////////////////////////////////////////////////////////////////
Expand Down
4 changes: 4 additions & 0 deletions src/coreclr/jit/block.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,10 @@ void BasicBlock::dspFlags()
{
printf("cfe ");
}
if (bbFlags & BBF_LOOP_ALIGN)
{
printf("align ");
}
}

/*****************************************************************************
Expand Down
5 changes: 5 additions & 0 deletions src/coreclr/jit/block.h
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,7 @@ struct BasicBlock : private LIR::Range

#define BBF_PATCHPOINT MAKE_BBFLAG(36) // Block is a patchpoint
#define BBF_HAS_CLASS_PROFILE MAKE_BBFLAG(37) // BB contains a call needing a class profile
#define BBF_LOOP_ALIGN MAKE_BBFLAG(39) // Block is lexically the first block in a loop we intend to align.

// clang-format on

Expand All @@ -463,6 +464,10 @@ struct BasicBlock : private LIR::Range
{
return ((bbFlags & BBF_LOOP_HEAD) != 0);
}
bool isLoopAlign() const
{
return ((bbFlags & BBF_LOOP_ALIGN) != 0);
}

// Flags to update when two blocks are compacted

Expand Down
14 changes: 11 additions & 3 deletions src/coreclr/jit/codegencommon.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2258,6 +2258,12 @@ void CodeGen::genGenerateMachineCode()

GetEmitter()->emitJumpDistBind();

#if FEATURE_LOOP_ALIGN
/* Perform alignment adjustments */

GetEmitter()->emitLoopAlignAdjustments();
#endif

/* The code is now complete and final; it should not change after this. */
}

Expand Down Expand Up @@ -2345,10 +2351,12 @@ void CodeGen::genEmitMachineCode()
#ifdef DEBUG
if (compiler->opts.disAsm || verbose)
{
printf("\n; Total bytes of code %d, prolog size %d, PerfScore %.2f, instruction count %d (MethodHash=%08x) for "
printf("\n; Total bytes of code %d, prolog size %d, PerfScore %.2f, instruction count %d, allocated bytes for "
"code %d (MethodHash=%08x) for "
"method %s\n",
codeSize, prologSize, compiler->info.compPerfScore, instrCount, compiler->info.compMethodHash(),
compiler->info.compFullName);
codeSize, prologSize, compiler->info.compPerfScore, instrCount,
GetEmitter()->emitTotalHotCodeSize + GetEmitter()->emitTotalColdCodeSize,
compiler->info.compMethodHash(), compiler->info.compFullName);
printf("; ============================================================\n\n");
printf(""); // in our logic this causes a flush
}
Expand Down
61 changes: 49 additions & 12 deletions src/coreclr/jit/codegenlinear.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -311,13 +311,6 @@ void CodeGen::genCodeForBBlist()

genUpdateCurrentFunclet(block);

#ifdef TARGET_XARCH
if (ShouldAlignLoops() && block->bbFlags & BBF_LOOP_HEAD)
{
GetEmitter()->emitLoopAlign();
}
#endif

genLogLabel(block);

// Tell everyone which basic block we're working on
Expand Down Expand Up @@ -356,6 +349,14 @@ void CodeGen::genCodeForBBlist()
needLabel = true;
}

#if FEATURE_LOOP_ALIGN
if (GetEmitter()->emitEndsWithAlignInstr())
{
// we had better be planning on starting a new IG
assert(needLabel);
}
#endif

if (needLabel)
{
// Mark a label and update the current set of live GC refs
Expand Down Expand Up @@ -667,10 +668,6 @@ void CodeGen::genCodeForBBlist()

switch (block->bbJumpKind)
{
case BBJ_ALWAYS:
inst_JMP(EJ_jmp, block->bbJumpDest);
break;

case BBJ_RETURN:
genExitCode(block);
break;
Expand Down Expand Up @@ -741,15 +738,55 @@ void CodeGen::genCodeForBBlist()
#endif // !FEATURE_EH_FUNCLETS

case BBJ_NONE:
case BBJ_COND:
case BBJ_SWITCH:
break;

case BBJ_ALWAYS:
inst_JMP(EJ_jmp, block->bbJumpDest);
FALLTHROUGH;

case BBJ_COND:

#if FEATURE_LOOP_ALIGN
// This is the last place where we operate on blocks and after this, we operate
// on IG. Hence, if we know that the destination of "block" is the first block
// of a loop and needs alignment (it has BBF_LOOP_ALIGN), then "block" represents
// end of the loop. Propagate that information on the IG through "igLoopBackEdge".
//
// During emitter, this information will be used to calculate the loop size.
// Depending on the loop size, decision of whether to align a loop or not will be taken.

if (block->bbJumpDest->isLoopAlign())
{
GetEmitter()->emitSetLoopBackEdge(block->bbJumpDest);
}
#endif
break;

default:
noway_assert(!"Unexpected bbJumpKind");
break;
}

#if FEATURE_LOOP_ALIGN

// If next block is the first block of a loop (identified by BBF_LOOP_ALIGN),
// then need to add align instruction in current "block". Also mark the
// corresponding IG with IGF_LOOP_ALIGN to know that there will be align
// instructions at the end of that IG.
//
// For non-adaptive alignment, add alignment instruction of size depending on the
// compJitAlignLoopBoundary.
// For adaptive alignment, alignment instruction will always be of 15 bytes.

if ((block->bbNext != nullptr) && (block->bbNext->isLoopAlign()))
{
assert(ShouldAlignLoops());

GetEmitter()->emitLoopAlignment();
}
#endif

#if defined(DEBUG) && defined(USING_VARIABLE_LIVE_RANGE)
if (compiler->verbose)
{
Expand Down
39 changes: 30 additions & 9 deletions src/coreclr/jit/compiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2308,7 +2308,7 @@ void Compiler::compSetProcessor()
opts.compUseCMOV = jitFlags.IsSet(JitFlags::JIT_FLAG_USE_CMOV);
#ifdef DEBUG
if (opts.compUseCMOV)
opts.compUseCMOV = !compStressCompile(STRESS_USE_CMOV, 50);
opts.compUseCMOV = !compStressCompile(STRESS_USE_CMOV, 50);
#endif // DEBUG

#endif // TARGET_X86
Expand Down Expand Up @@ -2615,6 +2615,29 @@ void Compiler::compInitOptions(JitFlags* jitFlags)
opts.compDbgInfo = jitFlags->IsSet(JitFlags::JIT_FLAG_DEBUG_INFO);
opts.compDbgEnC = jitFlags->IsSet(JitFlags::JIT_FLAG_DEBUG_EnC);

#ifdef DEBUG
opts.compJitAlignLoopAdaptive = JitConfig.JitAlignLoopAdaptive() == 1;
opts.compJitAlignLoopBoundary = (unsigned short)JitConfig.JitAlignLoopBoundary();
opts.compJitAlignLoopMinBlockWeight = (unsigned short)JitConfig.JitAlignLoopMinBlockWeight();

opts.compJitAlignLoopForJcc = JitConfig.JitAlignLoopForJcc() == 1;
opts.compJitAlignLoopMaxCodeSize = (unsigned short)JitConfig.JitAlignLoopMaxCodeSize();
#else
opts.compJitAlignLoopAdaptive = true;
opts.compJitAlignLoopBoundary = DEFAULT_ALIGN_LOOP_BOUNDARY;
opts.compJitAlignLoopMinBlockWeight = DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT;
#endif
if (opts.compJitAlignLoopAdaptive)
{
opts.compJitAlignPaddingLimit = (opts.compJitAlignLoopBoundary >> 1) - 1;
}
else
{
opts.compJitAlignPaddingLimit = opts.compJitAlignLoopBoundary - 1;
}

assert(isPow2(opts.compJitAlignLoopBoundary));

#if REGEN_SHORTCUTS || REGEN_CALLPAT
// We never want to have debugging enabled when regenerating GC encoding patterns
opts.compDbgCode = false;
Expand Down Expand Up @@ -3913,19 +3936,17 @@ void Compiler::compSetOptimizationLevel()
codeGen->setFrameRequired(true);
#endif

if (opts.jitFlags->IsSet(JitFlags::JIT_FLAG_RELOC))
if (opts.jitFlags->IsSet(JitFlags::JIT_FLAG_PREJIT))
{
codeGen->SetAlignLoops(false); // loop alignment not supported for prejitted code

// The zapper doesn't set JitFlags::JIT_FLAG_ALIGN_LOOPS, and there is
// no reason for it to set it as the JIT doesn't currently support loop alignment
// for prejitted images. (The JIT doesn't know the final address of the code, hence
// The JIT doesn't currently support loop alignment for prejitted images.
// (The JIT doesn't know the final address of the code, hence
// it can't align code based on unknown addresses.)
assert(!opts.jitFlags->IsSet(JitFlags::JIT_FLAG_ALIGN_LOOPS));

codeGen->SetAlignLoops(false); // loop alignment not supported for prejitted code
}
else
{
codeGen->SetAlignLoops(opts.jitFlags->IsSet(JitFlags::JIT_FLAG_ALIGN_LOOPS));
codeGen->SetAlignLoops(JitConfig.JitAlignLoops() == 1);
}
}

Expand Down
39 changes: 39 additions & 0 deletions src/coreclr/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -6367,6 +6367,8 @@ class Compiler

void optFindNaturalLoops();

void optIdentifyLoopsForAlignment();

// Ensures that all the loops in the loop nest rooted at "loopInd" (an index into the loop table) are 'canonical' --
// each loop has a unique "top." Returns "true" iff the flowgraph has been modified.
bool optCanonicalizeLoopNest(unsigned char loopInd);
Expand Down Expand Up @@ -9036,6 +9038,43 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
bool dspGCtbls; // Display the GC tables
#endif

// Default numbers used to perform loop alignment. All the numbers are choosen
// based on experimenting with various benchmarks.

// Default minimum loop block weight required to enable loop alignment.
#define DEFAULT_ALIGN_LOOP_MIN_BLOCK_WEIGHT 4

// By default a loop will be aligned at 32B address boundary to get better
// performance as per architecture manuals.
#define DEFAULT_ALIGN_LOOP_BOUNDARY 0x20

// For non-adaptive loop alignment, by default, only align a loop whose size is
// at most 3 times the alignment block size. If the loop is bigger than that, it is most
// likely complicated enough that loop alignment will not impact performance.
#define DEFAULT_MAX_LOOPSIZE_FOR_ALIGN DEFAULT_ALIGN_LOOP_BOUNDARY * 3

#ifdef DEBUG
// Loop alignment variables

// If set, for non-adaptive alignment, ensure loop jmps are not on or cross alignment boundary.
bool compJitAlignLoopForJcc;
#endif
// For non-adaptive alignment, minimum loop size (in bytes) for which alignment will be done.
unsigned short compJitAlignLoopMaxCodeSize;

// Minimum weight needed for the first block of a loop to make it a candidate for alignment.
unsigned short compJitAlignLoopMinBlockWeight;

// For non-adaptive alignment, address boundary (power of 2) at which loop alignment should
// be done. By default, 32B.
unsigned short compJitAlignLoopBoundary;

// Padding limit to align a loop.
unsigned short compJitAlignPaddingLimit;

// If set, perform adaptive loop alignment that limits number of padding based on loop size.
bool compJitAlignLoopAdaptive;

#ifdef LATE_DISASM
bool doLateDisasm; // Run the late disassembler
#endif // LATE_DISASM
Expand Down
Loading

0 comments on commit 16c48d6

Please sign in to comment.