Misc LSRA throughput improvements #85842

kunalspathak · 2023-05-05T21:44:50Z

While working on consecutive-registers, I realized few things that could help in the throughput:
~~1. We pass around RegisterType in various methods, but that parameter is only used for TARGET_ARM. So wrap the parameter in `ARM_ARG.~~ Done separately in #86016.

We call updateAssignedInterval() frequently, but more than half of the time, we pass interval == nullptr which is essentially clearing the interval. Introduced clearAssignedInterval() for that purpose.

~~3. Use BitOperations::PopCount() in a method that is used for IsSingleRegister() check.~~ Done separately as part of #85944.

ghost · 2023-05-05T21:45:00Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

While working on consecutive-registers, I realized few things that could help in the throughput:

We pass around RegisterType in various methods, but that parameter is only used for TARGET_ARM. So wrap the parameter in `ARM_ARG.
We call updateAssignedInterval() frequently, but more than half of the time, we pass interval == nullptr which is essentially clearing the interval. Introduced clearAssignedInterval() for that purpose.
Use BitOperations::PopCount() in a method that is used for IsSingleRegister() check.

Author:	kunalspathak
Assignees:	kunalspathak
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2023-05-05T23:39:06Z

TP regressions is surprising. Probably need to compare the assembly of before vs. after to see which individual change might be causing it.

kunalspathak · 2023-05-06T04:21:11Z

TP regressions is surprising. Probably need to compare the assembly of before vs. after to see which individual change might be causing it.

Base: 538795279, Diff: 549102807, +1.9131%

?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z : 3693050  : +30.40%  : 22.83% : +0.6854%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                             : 3035189  : +29.77%  : 18.76% : +0.5633%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                      : 2998888  : NA       : 18.54% : +0.5566%
?applySelection@RegisterSelection@LinearScan@@AEAA_NH_K@Z                                     : 2059105  : NA       : 12.73% : +0.3822%
??$select@$0A@@RegisterSelection@LinearScan@@QEAA_KPEAVInterval@@PEAVRefPosition@@@Z          : 1195021  : +4.50%   : 7.39%  : +0.2218%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                                    : 230276   : +3.90%   : 1.42%  : +0.0427%
?buildInternalRegisterUses@LinearScan@@AEAAXXZ                                                : 28726    : +8.16%   : 0.18%  : +0.0053%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@W4var_types@@@Z         : -2913849 : -100.00% : 18.01% : -0.5408%

I didn't realize that we do not use intrinsics for popcount and which is why we are seeing lot of regressions. This is yet another example of why cross compilation comparison for TP might not be always accurate. cc: @jakobbotsch @BruceForstall

runtime/src/coreclr/jit/utils.cpp

Lines 2788 to 2808 in 8854509

    
           uint32_t BitOperations::PopCount(uint32_t value) 
        
           { 
        
           #if defined(_MSC_VER) 
        
               // Inspired by the Stanford Bit Twiddling Hacks by Sean Eron Anderson: 
        
               // http://graphics.stanford.edu/~seander/bithacks.html 
        
               const uint32_t c1 = 0x55555555u; 
        
               const uint32_t c2 = 0x33333333u; 
        
               const uint32_t c3 = 0x0F0F0F0Fu; 
        
               const uint32_t c4 = 0x01010101u; 
        
               value -= (value >> 1) & c1; 
        
               value = (value & c2) + ((value >> 2) & c2); 
        
               value = (((value + (value >> 4)) & c3) * c4) >> 24; 
        
               return value; 
        
           #else 
        
               int32_t result = __builtin_popcount(value); 
        
               return static_cast<uint32_t>(result); 
        
           #endif 
        
           }

newRefposition:

buildInternalregisterusage

I will revert the popcount change.

This reverts commit a75a7da.

BruceForstall · 2023-05-06T04:35:16Z

I didn't realize that we do not use intrinsics for popcount

Would be worthwhile profiling current function versus popcount -- probably we should switch to popcount.

jakobbotsch · 2023-05-06T19:13:02Z

I didn't realize that we do not use intrinsics for popcount and which is why we are seeing lot of regressions. This is yet another example of why cross compilation comparison for TP might not be always accurate. cc: @jakobbotsch @BruceForstall

It's a good point, but also a place where we should actively ensure we have parity regardless of the compiler we are using. It seems unfortunate that we are regressing either MSVC produced code or Clang produced code.

kunalspathak · 2023-05-07T04:52:57Z

Looking at the diffs for minopts benchmarks_run windows-x64 I see slight regression:

Base: 538795279, Diff: 538880143, +0.0158%

?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z              : 2998888  : NA       : 50.71% : +0.5566%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@W4var_types@@@Z : -2913849 : -100.00% : 49.27% : -0.5408%

But looking at the code, we should still profitable, because we are eliminating a condition:

kunalspathak · 2023-05-08T18:24:19Z

Would be worthwhile profiling current function versus popcount -- probably we should switch to popcount.

Fixed.

tannergooding · 2023-05-08T21:44:15Z

It's a good point, but also a place where we should actively ensure we have parity regardless of the compiler we are using. It seems unfortunate that we are regressing either MSVC produced code or Clang produced code.

GCC/Clang generate the exact code that was codified for MSVC, they just do it implicitly via the builtin (and only switch to emitting actual popcnt if the ISA switch is passed in)

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

kunalspathak · 2023-05-08T21:47:48Z

if the ISA switch is passed in

you mean when building clrjit using clang, right?

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

Agree. btw, I do see that with VC++ popcnt removes lot of that code and generate the actual intrinsic which will be faster too. I will send a separate PR to see the effect of that alone rather than mixing up with some of the LSRA improvements I am doing here.

jakobbotsch · 2023-05-09T00:20:23Z

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

I'm confused, are you saying the MSVC multiplication code was faster than using a single popcnt instruction?

kunalspathak · 2023-05-09T04:15:17Z

GCC/Clang generate the exact code that was codified for MSVC

hhm. https://godbolt.org/z/rEndsvhv6

This reverts commit 0b3da21.

tannergooding · 2023-05-09T15:46:00Z

I'm confused, are you saying the MSVC multiplication code was faster than using a single popcnt instruction?

@jakobbotsch: No, rather GCC/Clang don't emit popcnt here because our target machine is -msse2. In order for GCC/Clang to emit popcnt the target machine must be at least -msse42. For pre sse4.2, they emit the same logic as the multiplication code.

In order for us to emit popcnt here, we'd need to do it "opportunistically" via a cached CPUID check (much as we do for atomic operations on Arm64):

if (supportsPopcnt)
{
    return __popcnt(value);
}
else
{
    // Bit twiddling logic
}

This reverts commit 46d4d3d.

kunalspathak · 2023-05-10T01:31:27Z

Latest diffs

BruceForstall · 2023-05-10T04:49:39Z

Odd it shows x64 as a slight regression.

kunalspathak added 6 commits May 4, 2023 23:19

Use BitOperations in few hot methods

a75a7da

Do not pass RegisterType for non-arm sarchitectures

46d4d3d

Add clearAssignedInterval()

ade6707

Consume clearAssignedInterval()

b9b3920

Do not pass RegisterType for updateInterval()

5ff83da

Revert the change in genFindLowestBit()

83436fe

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2023

ghost assigned kunalspathak May 5, 2023

Revert "Use BitOperations in few hot methods"

d76f001

This reverts commit a75a7da.

kunalspathak marked this pull request as ready for review May 7, 2023 04:53

kunalspathak added 4 commits May 8, 2023 10:35

Add the missing case for clearAssignedInterval()

20e497c

Remove logging

9fda672

jit formatting

40e49f9

Use popcount intrinsics

0b3da21

runfoapp bot mentioned this pull request May 8, 2023

Infra improvements for Helix #68176

Closed

kunalspathak mentioned this pull request May 8, 2023

Use popcount intrinsincs in BitOperations #85944

Closed

kunalspathak added 2 commits May 8, 2023 22:07

Revert "Use popcount intrinsics"

813cb84

This reverts commit 0b3da21.

revert unintentional change from superpmi.py

7128104

kunalspathak mentioned this pull request May 9, 2023

LSRA Throughput: Do not pass RegisterType for non-arm #86016

Closed

kunalspathak added 2 commits May 9, 2023 14:30

Revert "Do not pass RegisterType for non-arm sarchitectures"

a79cf1e

This reverts commit 46d4d3d.

fix the merge conflicts

7dd4d46

BruceForstall approved these changes May 10, 2023

View reviewed changes

kunalspathak merged commit af1de13 into dotnet:main May 10, 2023

kunalspathak deleted the clearAssignedInterval branch May 10, 2023 05:00

ghost locked as resolved and limited conversation to collaborators Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc LSRA throughput improvements #85842

Misc LSRA throughput improvements #85842

kunalspathak commented May 5, 2023 •

edited

Loading

ghost commented May 5, 2023

kunalspathak commented May 5, 2023

kunalspathak commented May 6, 2023

BruceForstall commented May 6, 2023

jakobbotsch commented May 6, 2023

kunalspathak commented May 7, 2023

kunalspathak commented May 8, 2023

tannergooding commented May 8, 2023 •

edited

Loading

kunalspathak commented May 8, 2023

jakobbotsch commented May 9, 2023

kunalspathak commented May 9, 2023

tannergooding commented May 9, 2023

kunalspathak commented May 10, 2023

BruceForstall commented May 10, 2023

Misc LSRA throughput improvements #85842

Misc LSRA throughput improvements #85842

Conversation

kunalspathak commented May 5, 2023 • edited Loading

ghost commented May 5, 2023

kunalspathak commented May 5, 2023

kunalspathak commented May 6, 2023

newRefposition:

buildInternalregisterusage

BruceForstall commented May 6, 2023

jakobbotsch commented May 6, 2023

kunalspathak commented May 7, 2023

kunalspathak commented May 8, 2023

tannergooding commented May 8, 2023 • edited Loading

kunalspathak commented May 8, 2023

jakobbotsch commented May 9, 2023

kunalspathak commented May 9, 2023

tannergooding commented May 9, 2023

kunalspathak commented May 10, 2023

BruceForstall commented May 10, 2023

kunalspathak commented May 5, 2023 •

edited

Loading

tannergooding commented May 8, 2023 •

edited

Loading