Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xarch blsi #66193

Merged
merged 3 commits into from
Mar 15, 2022
Merged

Add xarch blsi #66193

merged 3 commits into from
Mar 15, 2022

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Mar 4, 2022

This adds a lowering for the pattern AND(x, NEG(x)) to the ExtractLowestSetBit hwintrinsic.

The spmi replay is clean and there is only one asm diff:

; Assembly listing for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-;  V00 arg0         [V00,T00] (  6,  5.50)     int  ->  rsi         single-def
+;  V00 arg0         [V00,T00] (  5,  4.50)     int  ->  rsi         single-def
;* V01 loc0         [V01    ] (  0,  0   )     int  ->  zero-ref   
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  3,  2   )     int  ->  rcx        
;  V04 tmp2         [V04,T01] (  2,  4   )    bool  ->  rcx         "Inlining Arg"
;  V05 cse0         [V05,T03] (  3,  1.50)     ref  ->  rdx         "CSE - moderate"
;
; Lcl frame size = 32

G_M29069_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
       push     rsi
       sub      rsp, 32
       mov      esi, ecx
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 4
       je       SHORT G_M29069_IG04
						;; bbWeight=1    PerfScore 1.25
G_M29069_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 5
       sete     cl
       movzx    rcx, cl
       jmp      SHORT G_M29069_IG05
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      ecx, 1
						;; bbWeight=0.50 PerfScore 0.12
G_M29069_IG05:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       movzx    rcx, cl
       test     ecx, ecx
       jne      SHORT G_M29069_IG07
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG06:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      rcx, 0xD1FFAB1E      ; string handle
       mov      rdx, gword ptr [rcx]
       ; gcrRegs +[rdx]
       mov      rcx, rdx
       ; gcrRegs +[rcx]
       call     hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx rdx]
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG07:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
+      blsi     eax, esi
-      mov      eax, esi
-      neg      eax
-      and      eax, esi
       shl      eax, 28
+						;; bbWeight=1    PerfScore 1.00
-						;; bbWeight=1    PerfScore 1.25
G_M29069_IG08:        ; , epilog, nogc, extend
       add      rsp, 32
       pop      rsi
       ret      
						;; bbWeight=1    PerfScore 1.75
+; Total bytes of code 70, prolog size 5, PerfScore 17.63, instruction count 22, allocated bytes for code 70 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
-; Total bytes of code 71, prolog size 5, PerfScore 17.98, instruction count 24, allocated bytes for code 71 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x05
  CountOfUnwindCodes: 2
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)

The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that even if this is closed anyone else ever wonders why blsi isn't used can see the results of implementing it.

/cc @dotnet/jit-contrib

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 4, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Mar 4, 2022
@ghost
Copy link

ghost commented Mar 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This adds a lowering for the pattern AND(x, NEG(x)) to the ExtractLowestSetBit hwintrinsic.

The spmi replay is clean and there is only one asm diff:

; Assembly listing for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-;  V00 arg0         [V00,T00] (  6,  5.50)     int  ->  rsi         single-def
+;  V00 arg0         [V00,T00] (  5,  4.50)     int  ->  rsi         single-def
;* V01 loc0         [V01    ] (  0,  0   )     int  ->  zero-ref   
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  3,  2   )     int  ->  rcx        
;  V04 tmp2         [V04,T01] (  2,  4   )    bool  ->  rcx         "Inlining Arg"
;  V05 cse0         [V05,T03] (  3,  1.50)     ref  ->  rdx         "CSE - moderate"
;
; Lcl frame size = 32

G_M29069_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
       push     rsi
       sub      rsp, 32
       mov      esi, ecx
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 4
       je       SHORT G_M29069_IG04
						;; bbWeight=1    PerfScore 1.25
G_M29069_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 5
       sete     cl
       movzx    rcx, cl
       jmp      SHORT G_M29069_IG05
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      ecx, 1
						;; bbWeight=0.50 PerfScore 0.12
G_M29069_IG05:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       movzx    rcx, cl
       test     ecx, ecx
       jne      SHORT G_M29069_IG07
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG06:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      rcx, 0xD1FFAB1E      ; string handle
       mov      rdx, gword ptr [rcx]
       ; gcrRegs +[rdx]
       mov      rcx, rdx
       ; gcrRegs +[rcx]
       call     hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx rdx]
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG07:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
+      blsi     eax, esi
-      mov      eax, esi
-      neg      eax
-      and      eax, esi
       shl      eax, 28
+						;; bbWeight=1    PerfScore 1.00
-						;; bbWeight=1    PerfScore 1.25
G_M29069_IG08:        ; , epilog, nogc, extend
       add      rsp, 32
       pop      rsi
       ret      
						;; bbWeight=1    PerfScore 1.75
+; Total bytes of code 70, prolog size 5, PerfScore 17.63, instruction count 22, allocated bytes for code 70 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
-; Total bytes of code 71, prolog size 5, PerfScore 17.98, instruction count 24, allocated bytes for code 71 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x05
  CountOfUnwindCodes: 2
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)

The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that if even if this is closed anyone else ever wonders why blsi isn't used can see the results of implementing it.

/cc @dotnet/jit-contrib

Author: Wraith2
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member

I'm ok with taking this even if it's not all that commonly hit, but it would be nice to have more coverage.

I'm curious what lead you to choose to work on this item. Was there an example that was an inspiration? Is there some obvious C# pattern that gets us to generating one of these? If so, can you add a test case or two? Or, are there patterns or upstream canonicalization we should be doing to make this more likely?

The diff above seems to come from somebody being clever:

int ct = (int)comparisonType;
return (CompareOptions)((ct & -ct) << 28); // neg and shl

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 9, 2022

I picked up the blsr issue which was up for grabs because it would improve perf slightly. After that having been looking at the bmi1 instructions I did andn because there were clearly some uses in the BCL. It felt like I was leaving a job unfinished to not look at the remaining instructions.

I'm not sure if I want to try and match the pattern for bestr and I'm not going to try tzcnt which left blsi' and blmsk, as you see blsionly has one use and I've got a branch withblmskbut that has no hits in the BCL. If this is merged there is a question of whether it's worth puttingBLMSK` in as well in case it's important for end users.

For each of blsi and blmsk I verified that they match the simple patterns that define them but there are no obvious c# patterns that generate those. Bit manipulation outside flags attributes is quite rare for most of the code we have. These things may be more useful in lower level scenarios than we see in this repro.

What sort of test coverage would you like to see?

@AndyAyersMS
Copy link
Member

What sort of test coverage would you like to see?

A simple standalone test that exercises the code, added to the jit test tree somewhere. The SPMI coverage comes from a benchmark run which means it is likely to be hit in regular testing, but it would be good to have something we know for sure will exercise the new code.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 9, 2022

Something similar to https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Intrinsics/BitOperationsPopCount.cs then? I could add similar projects and cover the 4 bmi1 instructions in it.

@AndyAyersMS
Copy link
Member

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 13, 2022

Test added, failures don't look related.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for adding those tests.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Mar 13, 2022

Alpine libraries test failure is possibly region related? cc @mangod9

Assert failure(PID 24 [0x00000018], Thread: 54 [0x0036]): global_regions_to_decommit[kind].get_num_free_regions() == num_regions_to_decommit_before + (size_t)num_regions_to_decommit[kind]
    File: /__w/1/s/src/coreclr/gc/gc.cpp Line: 12432
    Image: /root/helix/work/correlation/dotnet

Windows libraries test failure (perhaps novel?)

    System.Runtime.InteropServices.HandleCollectorTests.TestHandleCollector [FAIL]
      High limit handle did not trigger a GC
      Expected: True
      Actual:   Fals

Mono LLVM AOT failure is (#66556)

2022-03-13T03:43:00.3828182Z   /__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/opt: mono_aot_hWo4gC/temp.bc: error: Invalid record (Producer: 'LLVM11.1.0' Reader: 'LLVM 11.1.0')
2022-03-13T03:43:00.3846354Z   AOT of image /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll failed.
2022-03-13T03:43:00.3853924Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll
2022-03-13T03:43:00.3856091Z   AOTID 8C85F8F3-9F17-D6BC-372A-22815EB68F09
2022-03-13T03:43:00.3858270Z   Executing opt: "/__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/opt" -f -O2 -disable-tail-calls -place-safepoints -spp-all-backedges -mattr=crc,crypto -o "mono_aot_hWo4gC/temp.opt.bc" "mono_aot_hWo4gC/temp.bc"
2022-03-13T03:43:00.3899763Z /__w/1/s/src/mono/msbuild/aot-compile.proj(19,9): error MSB3073: The command "/__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/mono-aot-cross /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll" exited with code 1.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 13, 2022

Any thoughts on whether I should push the blsmsk to complete the set? it has no hits in the runtime but it's pretty trivially implemented and may be of use to someone.

@AndyAyersMS
Copy link
Member

Any thoughts on whether I should push the blsmsk to complete the set? it has no hits in the runtime but it's pretty trivially implemented and may be of use to someone.

If it's that easy to do, sure.

@Wraith2 Wraith2 mentioned this pull request Mar 13, 2022
@mangod9
Copy link
Member

mangod9 commented Mar 14, 2022

Yeah the alpine failure is related to Regions. @PeterSolMS had fixed a similar issue recently in #66495, but perhaps more work is required. Believe it might be intermittent failure.

@AndyAyersMS AndyAyersMS merged commit 6bf873a into dotnet:main Mar 15, 2022
@AndyAyersMS
Copy link
Member

@Wraith2 thanks!

@Wraith2 Wraith2 deleted the add-xarch-blsi branch March 15, 2022 01:00
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
* implement blsi

* add bmi intrinsics test projects

* add using System for Console.
@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants