Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc LSRA throughput improvements #85842

Merged
merged 15 commits into from
May 10, 2023

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented May 5, 2023

While working on consecutive-registers, I realized few things that could help in the throughput:
1. We pass around RegisterType in various methods, but that parameter is only used for TARGET_ARM. So wrap the parameter in `ARM_ARG. Done separately in #86016.

  1. We call updateAssignedInterval() frequently, but more than half of the time, we pass interval == nullptr which is essentially clearing the interval. Introduced clearAssignedInterval() for that purpose.

3. Use BitOperations::PopCount() in a method that is used for IsSingleRegister() check. Done separately as part of #85944.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2023
@ghost ghost assigned kunalspathak May 5, 2023
@ghost
Copy link

ghost commented May 5, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

While working on consecutive-registers, I realized few things that could help in the throughput:

  1. We pass around RegisterType in various methods, but that parameter is only used for TARGET_ARM. So wrap the parameter in `ARM_ARG.
  2. We call updateAssignedInterval() frequently, but more than half of the time, we pass interval == nullptr which is essentially clearing the interval. Introduced clearAssignedInterval() for that purpose.
  3. Use BitOperations::PopCount() in a method that is used for IsSingleRegister() check.
Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member Author

TP regressions is surprising. Probably need to compare the assembly of before vs. after to see which individual change might be causing it.

@kunalspathak
Copy link
Member Author

TP regressions is surprising. Probably need to compare the assembly of before vs. after to see which individual change might be causing it.

Base: 538795279, Diff: 549102807, +1.9131%

?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z : 3693050  : +30.40%  : 22.83% : +0.6854%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                             : 3035189  : +29.77%  : 18.76% : +0.5633%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                      : 2998888  : NA       : 18.54% : +0.5566%
?applySelection@RegisterSelection@LinearScan@@AEAA_NH_K@Z                                     : 2059105  : NA       : 12.73% : +0.3822%
??$select@$0A@@RegisterSelection@LinearScan@@QEAA_KPEAVInterval@@PEAVRefPosition@@@Z          : 1195021  : +4.50%   : 7.39%  : +0.2218%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                                    : 230276   : +3.90%   : 1.42%  : +0.0427%
?buildInternalRegisterUses@LinearScan@@AEAAXXZ                                                : 28726    : +8.16%   : 0.18%  : +0.0053%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@W4var_types@@@Z         : -2913849 : -100.00% : 18.01% : -0.5408%

I didn't realize that we do not use intrinsics for popcount and which is why we are seeing lot of regressions. This is yet another example of why cross compilation comparison for TP might not be always accurate. cc: @jakobbotsch @BruceForstall

uint32_t BitOperations::PopCount(uint32_t value)
{
#if defined(_MSC_VER)
// Inspired by the Stanford Bit Twiddling Hacks by Sean Eron Anderson:
// http://graphics.stanford.edu/~seander/bithacks.html
const uint32_t c1 = 0x55555555u;
const uint32_t c2 = 0x33333333u;
const uint32_t c3 = 0x0F0F0F0Fu;
const uint32_t c4 = 0x01010101u;
value -= (value >> 1) & c1;
value = (value & c2) + ((value >> 2) & c2);
value = (((value + (value >> 4)) & c3) * c4) >> 24;
return value;
#else
int32_t result = __builtin_popcount(value);
return static_cast<uint32_t>(result);
#endif
}

newRefposition:

image

buildInternalregisterusage

image

I will revert the popcount change.

@BruceForstall
Copy link
Member

I didn't realize that we do not use intrinsics for popcount

Would be worthwhile profiling current function versus popcount -- probably we should switch to popcount.

@jakobbotsch
Copy link
Member

I didn't realize that we do not use intrinsics for popcount and which is why we are seeing lot of regressions. This is yet another example of why cross compilation comparison for TP might not be always accurate. cc: @jakobbotsch @BruceForstall

It's a good point, but also a place where we should actively ensure we have parity regardless of the compiler we are using. It seems unfortunate that we are regressing either MSVC produced code or Clang produced code.

@kunalspathak
Copy link
Member Author

image

Looking at the diffs for minopts benchmarks_run windows-x64 I see slight regression:

Base: 538795279, Diff: 538880143, +0.0158%

?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z              : 2998888  : NA       : 50.71% : +0.5566%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@W4var_types@@@Z : -2913849 : -100.00% : 49.27% : -0.5408%

But looking at the code, we should still profitable, because we are eliminating a condition:

image

@kunalspathak kunalspathak marked this pull request as ready for review May 7, 2023 04:53
@kunalspathak
Copy link
Member Author

Would be worthwhile profiling current function versus popcount -- probably we should switch to popcount.

Fixed.

@runfoapp runfoapp bot mentioned this pull request May 8, 2023
@tannergooding
Copy link
Member

tannergooding commented May 8, 2023

It's a good point, but also a place where we should actively ensure we have parity regardless of the compiler we are using. It seems unfortunate that we are regressing either MSVC produced code or Clang produced code.

GCC/Clang generate the exact code that was codified for MSVC, they just do it implicitly via the builtin (and only switch to emitting actual popcnt if the ISA switch is passed in)

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

@kunalspathak
Copy link
Member Author

if the ISA switch is passed in

you mean when building clrjit using clang, right?

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

Agree. btw, I do see that with VC++ popcnt removes lot of that code and generate the actual intrinsic which will be faster too. I will send a separate PR to see the effect of that alone rather than mixing up with some of the LSRA improvements I am doing here.

@jakobbotsch
Copy link
Member

This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice.

I'm confused, are you saying the MSVC multiplication code was faster than using a single popcnt instruction?

@kunalspathak
Copy link
Member Author

GCC/Clang generate the exact code that was codified for MSVC

hhm. https://godbolt.org/z/rEndsvhv6

@tannergooding
Copy link
Member

I'm confused, are you saying the MSVC multiplication code was faster than using a single popcnt instruction?

@jakobbotsch: No, rather GCC/Clang don't emit popcnt here because our target machine is -msse2. In order for GCC/Clang to emit popcnt the target machine must be at least -msse42. For pre sse4.2, they emit the same logic as the multiplication code.

In order for us to emit popcnt here, we'd need to do it "opportunistically" via a cached CPUID check (much as we do for atomic operations on Arm64):

if (supportsPopcnt)
{
    return __popcnt(value);
}
else
{
    // Bit twiddling logic
}

@kunalspathak
Copy link
Member Author

Latest diffs

image

@BruceForstall
Copy link
Member

Odd it shows x64 as a slight regression.

@kunalspathak kunalspathak merged commit af1de13 into dotnet:main May 10, 2023
@kunalspathak kunalspathak deleted the clearAssignedInterval branch May 10, 2023 05:00
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants