Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BSF and BSR fallbacks for BitOperations methods #34550

Merged
merged 14 commits into from
Apr 29, 2020
Merged

Add BSF and BSR fallbacks for BitOperations methods #34550

merged 14 commits into from
Apr 29, 2020

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Apr 4, 2020

Testing the assertion @tannergooding made in #953 (comment) that adding x86 intrinsics is as easy as 1,2,3,4,5,6,7 😄

I wanted to get feedback on the introduction of an X86Base ISA to mirror what we have on ARM, and an X86Base class in S.R.I to house the methods for new instructions. I recall @jkotas mentioning the possibility of exposing some of the x86 base instructions as internal HWIntrinsics that could be used from higher-level methods, although I can't seem to find the discussion now.

I've added HWIntrinsics implementations of BSF and BSR and used them as fallbacks in some of the BitOperations methods that use LZCNT and TZCNT. Since BMI1 requires VEX encoding, it won't be supported on older processors or newer non-VEX processors like the Intel Atom line. The fallbacks will also improve codegen in R2R images, where VEX encoding is disabled.

Benchmark results with BMI1 and LZCNT disabled:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.300-preview-015048
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-XPOYMR : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-ROWHAK : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
LeadingZeroCount_uint \corerun-bsf-bsr\CoreRun.exe 1,176.0 ns 8.28 ns 7.74 ns 1,176.4 ns 1,161.6 ns 1,188.7 ns 0.30 0.01 - - - -
LeadingZeroCount_uint \corerun-master\CoreRun.exe 3,867.1 ns 172.27 ns 198.39 ns 3,855.8 ns 3,509.7 ns 4,214.0 ns 1.00 0.00 - - - -
LeadingZeroCount_ulong \corerun-bsf-bsr\CoreRun.exe 1,156.3 ns 22.38 ns 23.94 ns 1,158.5 ns 1,115.4 ns 1,204.8 ns 0.36 0.01 - - - -
LeadingZeroCount_ulong \corerun-master\CoreRun.exe 3,185.6 ns 73.96 ns 82.21 ns 3,178.5 ns 3,091.1 ns 3,362.5 ns 1.00 0.00 - - - -
Log2_uint \corerun-bsf-bsr\CoreRun.exe 883.6 ns 9.93 ns 9.29 ns 880.6 ns 871.2 ns 903.2 ns 0.31 0.01 - - - -
Log2_uint \corerun-master\CoreRun.exe 2,806.4 ns 44.82 ns 41.93 ns 2,807.6 ns 2,730.5 ns 2,872.6 ns 1.00 0.00 - - - -
Log2_ulong \corerun-bsf-bsr\CoreRun.exe 875.5 ns 12.38 ns 11.58 ns 876.3 ns 856.2 ns 896.4 ns 0.22 0.00 - - - -
Log2_ulong \corerun-master\CoreRun.exe 4,058.1 ns 42.23 ns 39.51 ns 4,059.0 ns 4,001.2 ns 4,138.1 ns 1.00 0.00 - - - -
TrailingZeroCount_uint \corerun-bsf-bsr\CoreRun.exe 885.0 ns 8.74 ns 7.30 ns 885.3 ns 874.4 ns 897.3 ns 0.52 0.08 - - - -
TrailingZeroCount_uint \corerun-master\CoreRun.exe 1,710.8 ns 206.28 ns 237.55 ns 1,607.4 ns 1,509.7 ns 2,174.5 ns 1.00 0.00 - - - -
TrailingZeroCount_ulong \corerun-bsf-bsr\CoreRun.exe 1,030.8 ns 19.25 ns 19.76 ns 1,036.3 ns 984.6 ns 1,061.7 ns 0.69 0.02 - - - -
TrailingZeroCount_ulong \corerun-master\CoreRun.exe 1,484.5 ns 45.58 ns 52.49 ns 1,470.6 ns 1,425.7 ns 1,625.9 ns 1.00 0.00 - - - -

And enabled:

Method Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
LeadingZeroCount_uint \corerun-bsf-bsr\CoreRun.exe 632.8 ns 8.39 ns 7.85 ns 630.3 ns 624.2 ns 649.7 ns 0.99 0.02 - - - -
LeadingZeroCount_uint \corerun-master\CoreRun.exe 636.9 ns 8.66 ns 8.10 ns 636.2 ns 623.9 ns 653.7 ns 1.00 0.00 - - - -
LeadingZeroCount_ulong \corerun-bsf-bsr\CoreRun.exe 634.0 ns 8.36 ns 7.82 ns 632.2 ns 624.7 ns 649.9 ns 1.00 0.01 - - - -
LeadingZeroCount_ulong \corerun-master\CoreRun.exe 631.2 ns 8.40 ns 7.86 ns 629.3 ns 616.1 ns 643.2 ns 1.00 0.00 - - - -
Log2_uint \corerun-bsf-bsr\CoreRun.exe 729.5 ns 14.35 ns 12.72 ns 728.4 ns 711.4 ns 751.3 ns 0.70 0.05 - - - -
Log2_uint \corerun-master\CoreRun.exe 1,021.8 ns 62.09 ns 71.51 ns 1,005.0 ns 926.7 ns 1,175.7 ns 1.00 0.00 - - - -
Log2_ulong \corerun-bsf-bsr\CoreRun.exe 735.5 ns 7.79 ns 7.29 ns 737.1 ns 720.3 ns 748.9 ns 0.42 0.01 - - - -
Log2_ulong \corerun-master\CoreRun.exe 1,743.7 ns 33.28 ns 34.18 ns 1,739.0 ns 1,684.3 ns 1,821.5 ns 1.00 0.00 - - - -
TrailingZeroCount_uint \corerun-bsf-bsr\CoreRun.exe 626.5 ns 9.07 ns 8.48 ns 625.5 ns 614.8 ns 640.9 ns 0.99 0.02 - - - -
TrailingZeroCount_uint \corerun-master\CoreRun.exe 633.5 ns 8.48 ns 7.93 ns 631.2 ns 622.1 ns 648.7 ns 1.00 0.00 - - - -
TrailingZeroCount_ulong \corerun-bsf-bsr\CoreRun.exe 636.7 ns 11.54 ns 10.79 ns 632.3 ns 621.0 ns 661.9 ns 1.01 0.02 - - - -
TrailingZeroCount_ulong \corerun-master\CoreRun.exe 630.5 ns 7.80 ns 7.29 ns 627.5 ns 620.2 ns 648.0 ns 1.00 0.00 - - - -

Note the nice little improvement to Log2 by replacing a subtraction with xor to reverse the index returned by LZCNT

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 4, 2020
@tannergooding tannergooding self-assigned this Apr 4, 2020
@tannergooding
Copy link
Member

I probably won't get to reviewing this before Monday, but assigning myself so I don't forget 😄

@tannergooding
Copy link
Member

Looks like something isn't quite hooked up right as X86Base.IsSupported is returning false for various tests.

@tannergooding
Copy link
Member

CC. @CarolEidt and @echesakovMSFT

Also CC. @davidwrighton to help validate the new IsSupported check is hooked up correctly

@@ -14818,6 +14819,8 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
result.insLatency += PERFSCORE_LATENCY_2C;
break;

case INS_bsf:
case INS_bsr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we had a way to model Intel vs AMD here.

Instruction Intel AMD
BSF 1.00 4.00
BSR 1.00 5.00
POPCNT 1.00 0.50
LZCNT 1.00 0.50
TZCNT 1.00 0.50

// Arguments:
// node - The hardware intrinsic node
//
void CodeGen::genX86BaseIntrinsic(GenTreeHWIntrinsic* node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR, but we allowed HW_Category_Scalar intrinsics to be table driven in importation.
It would be nice to do the same here in codegen...

@saucecontrol
Copy link
Member Author

saucecontrol commented Apr 7, 2020

Looks like something isn't quite hooked up right as X86Base.IsSupported is returning false for various tests.

I saw that ArmBase is explicitly added with the Vector base ISAs, even though it looks like it should already be there from the VM side:

https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2313-L2320

https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/codeman.cpp#L1485-L1487

I couldn't see any cases where X86Base wasn't working, but maybe it needs the same treatment if you're seeing a failure.

@tannergooding
Copy link
Member

I couldn't see any cases where X86Base wasn't working, but maybe it needs the same treatment if you're seeing a failure.

Seeing failures in CI for x86.

@saucecontrol
Copy link
Member Author

Ah, I found one:

System.PlatformNotSupportedException : Operation is not supported on this platform.
      Stack Trace:
        /_/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs(34,0): at System.Runtime.Intrinsics.X86.X86Base.X64.BitScanReverse(UInt64 value)

So it looks like X86Base.X64.IsSupported is returning true when it shouldn't. I'll investigate

@tannergooding
Copy link
Member

We need the instruction sets added here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2205, in the same way as we do for ARM here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2316

It looks like all the other JIT bits exist (comparing to ArmBase and Sse.X64)

@saucecontrol
Copy link
Member Author

The way I read it, the explicit addition of ArmBase is redundant because it's passed in from the VM side in the same way the higher-level ISAs are. Since I've added X86Base unconditionally here:

https://github.com/dotnet/runtime/blob/47bc99ba0b45f8cf2144e590013c61ad911865db/src/coreclr/src/vm/codeman.cpp#L1302

It's always available unless removed based on the config knobs, just like SSE+ are.

Am I missing something there?

@tannergooding
Copy link
Member

The way I read it, the explicit addition of ArmBase is redundant because it's passed in from the VM side in the same way the higher-level ISAs are. Since I've added X86Base unconditionally here:

@davidwrighton, did the base checks get inverted? That is should https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.cpp#L2205 be:

if (!JitConfig.EnableHWIntrinsic())
{
    // Dummy ISAs for simplifying the JIT code
    instructionSetFlags.RemoveInstructionSet(InstructionSet_Vector128);
    instructionSetFlags.RemoveInstructionSet(InstructionSet_Vector256);
}

You are meant to be able to disable the "baseline" as well, so you can test your software fallback.

@saucecontrol
Copy link
Member Author

The Vector ISAs don't actually exist, so they're not passed into the JIT. Unless JitConfig.EnableHWIntrinsic is true, they never get added so you get the software fallbacks.

The real ISAs work the opposite way. The VM passes them in if the hardware supports them and the JIT disables them if the config says to.

X86Base is working as expected with that model. It reports IsSupported if COMPlus_EnableHWIntrinsic is on and not if it's not.

@saucecontrol
Copy link
Member Author

It's probably a good idea to make ArmBase and X86Base handled consistently in compiler.cpp. I actually found the explicit addition of ArmBase with the dummy ISAs confusing, so I'd be in favor of removing that if it's not actually necessary.

If I've misunderstood any of that, I guess we should add X86Base explicitly to make them match 🤷‍♂

@saucecontrol saucecontrol marked this pull request as ready for review April 7, 2020 22:23
@davidwrighton
Copy link
Member

@tannergooding @saucecontrol Vector64/128/256 are set by the JIT itself as Tanner noted.

One of the changes I did not realize I made was to make ArmBase set by the VM side. One of the issues is that the VM now attempts to pass a rationalized set of instruction set values to the JIT, and not enabling ArmBase, then requires the VM side to disable all of the other instruction sets. We could tweak that if we needed to but I need to know its necessary before I futz with it.

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

@davidwrighton
Copy link
Member

And yes, the current model doesn't make sense where we explicitly set the ArmBase instruction set in both the jit and the VM side of the fence.

@saucecontrol
Copy link
Member Author

saucecontrol commented Apr 7, 2020

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

There's presently no way to disable X86Base other than to disable HWIntrinsics entirely. Isn't that the same way ArmBase works? Or would this be a case where disabling HWIntrinsics also disables SIMD?

And yes, the current model doesn't make sense where we explicitly set the ArmBase instruction set in both the jit and the VM side of the fence.

Would removing it from the JIT side be ok for now, given the larger changes necessary to separate it on the VM side?

@tannergooding
Copy link
Member

tannergooding commented Apr 7, 2020

There's presently no way to disable X86Base other than to disable HWIntrinsics entirely. Isn't that the same way ArmBase works? Or would this be a case where disabling HWIntrinsics also disables SIMD?

That is the expectation.

The helper types and the "base" ISAs don't have their own COMPlus knob, instead they are controlled by the COMPlus_EnableHWIntrinsics knob.
They still participate in the hierarchy and so doing EnableHWIntrinsics=0 will also disable Sse and others (we can't very well use Sse intrinsics if the user has disabled HWIntrinsics in general, after all).

@tannergooding
Copy link
Member

I would avoid marking the X86Base as being implied by support for SSE, as that means if it is disabled, then SSE will be disabled.

That is exactly what we want. Disabling X86Base requires disabling HWIntrinsics in general, which should also disable SSE (only intrinsics, not general JIT support outside HWIntrinsics of course).

@tannergooding
Copy link
Member

I logged #35305 to track the emitter's handling of the 'w' and 's' encoding bits.

@saucecontrol
Copy link
Member Author

Rebased on master to bring in #35364, and updated to the new intrinsiclist table layout

@saucecontrol
Copy link
Member Author

Test failures related to Array.Sort gave me a scare for a second, but those were apparently #35438

@CarolEidt
Copy link
Contributor

ping @davidwrighton - could you have a look at the ISA handling here?

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I believe the new instruction set should be enabled by the existing pathways in crossgen2, so no additional crossgen2 change should be needed.

@saucecontrol
Copy link
Member Author

saucecontrol commented Apr 27, 2020

Rebased and re-ran ThunkGenerator to resolve conflicts with #35504 and #341

@CarolEidt CarolEidt merged commit d10a782 into dotnet:master Apr 29, 2020
@tannergooding
Copy link
Member

Thanks for the contribution @saucecontrol!

@saucecontrol saucecontrol deleted the bsf-bsr branch April 29, 2020 05:22
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants