Add xarch `blsi` #66193

Wraith2 · 2022-03-04T13:44:46Z

This adds a lowering for the pattern AND(x, NEG(x)) to the ExtractLowestSetBit hwintrinsic.

The spmi replay is clean and there is only one asm diff:

; Assembly listing for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-;  V00 arg0         [V00,T00] (  6,  5.50)     int  ->  rsi         single-def
+;  V00 arg0         [V00,T00] (  5,  4.50)     int  ->  rsi         single-def
;* V01 loc0         [V01    ] (  0,  0   )     int  ->  zero-ref   
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  3,  2   )     int  ->  rcx        
;  V04 tmp2         [V04,T01] (  2,  4   )    bool  ->  rcx         "Inlining Arg"
;  V05 cse0         [V05,T03] (  3,  1.50)     ref  ->  rdx         "CSE - moderate"
;
; Lcl frame size = 32

G_M29069_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
       push     rsi
       sub      rsp, 32
       mov      esi, ecx
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 4
       je       SHORT G_M29069_IG04
						;; bbWeight=1    PerfScore 1.25
G_M29069_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 5
       sete     cl
       movzx    rcx, cl
       jmp      SHORT G_M29069_IG05
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      ecx, 1
						;; bbWeight=0.50 PerfScore 0.12
G_M29069_IG05:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       movzx    rcx, cl
       test     ecx, ecx
       jne      SHORT G_M29069_IG07
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG06:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      rcx, 0xD1FFAB1E      ; string handle
       mov      rdx, gword ptr [rcx]
       ; gcrRegs +[rdx]
       mov      rcx, rdx
       ; gcrRegs +[rcx]
       call     hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx rdx]
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG07:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
+      blsi     eax, esi
-      mov      eax, esi
-      neg      eax
-      and      eax, esi
       shl      eax, 28
+						;; bbWeight=1    PerfScore 1.00
-						;; bbWeight=1    PerfScore 1.25
G_M29069_IG08:        ; , epilog, nogc, extend
       add      rsp, 32
       pop      rsi
       ret      
						;; bbWeight=1    PerfScore 1.75
+; Total bytes of code 70, prolog size 5, PerfScore 17.63, instruction count 22, allocated bytes for code 70 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
-; Total bytes of code 71, prolog size 5, PerfScore 17.98, instruction count 24, allocated bytes for code 71 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x05
  CountOfUnwindCodes: 2
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)

The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that even if this is closed anyone else ever wonders why blsi isn't used can see the results of implementing it.

/cc @dotnet/jit-contrib

ghost · 2022-03-04T13:44:54Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This adds a lowering for the pattern AND(x, NEG(x)) to the ExtractLowestSetBit hwintrinsic.

The spmi replay is clean and there is only one asm diff:

; Assembly listing for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-;  V00 arg0         [V00,T00] (  6,  5.50)     int  ->  rsi         single-def
+;  V00 arg0         [V00,T00] (  5,  4.50)     int  ->  rsi         single-def
;* V01 loc0         [V01    ] (  0,  0   )     int  ->  zero-ref   
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  3,  2   )     int  ->  rcx        
;  V04 tmp2         [V04,T01] (  2,  4   )    bool  ->  rcx         "Inlining Arg"
;  V05 cse0         [V05,T03] (  3,  1.50)     ref  ->  rdx         "CSE - moderate"
;
; Lcl frame size = 32

G_M29069_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
       push     rsi
       sub      rsp, 32
       mov      esi, ecx
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 4
       je       SHORT G_M29069_IG04
						;; bbWeight=1    PerfScore 1.25
G_M29069_IG03:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       cmp      esi, 5
       sete     cl
       movzx    rcx, cl
       jmp      SHORT G_M29069_IG05
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG04:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      ecx, 1
						;; bbWeight=0.50 PerfScore 0.12
G_M29069_IG05:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
       movzx    rcx, cl
       test     ecx, ecx
       jne      SHORT G_M29069_IG07
						;; bbWeight=1    PerfScore 1.50
G_M29069_IG06:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
       mov      rcx, 0xD1FFAB1E      ; string handle
       mov      rdx, gword ptr [rcx]
       ; gcrRegs +[rdx]
       mov      rcx, rdx
       ; gcrRegs +[rcx]
       call     hackishModuleName:hackishMethodName()
       ; gcrRegs -[rcx rdx]
       ; gcr arg pop 0
						;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG07:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
+      blsi     eax, esi
-      mov      eax, esi
-      neg      eax
-      and      eax, esi
       shl      eax, 28
+						;; bbWeight=1    PerfScore 1.00
-						;; bbWeight=1    PerfScore 1.25
G_M29069_IG08:        ; , epilog, nogc, extend
       add      rsp, 32
       pop      rsi
       ret      
						;; bbWeight=1    PerfScore 1.75
+; Total bytes of code 70, prolog size 5, PerfScore 17.63, instruction count 22, allocated bytes for code 70 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
-; Total bytes of code 71, prolog size 5, PerfScore 17.98, instruction count 24, allocated bytes for code 71 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; ============================================================

Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x05
  CountOfUnwindCodes: 2
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)

The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that if even if this is closed anyone else ever wonders why blsi isn't used can see the results of implementing it.

/cc @dotnet/jit-contrib

Author:	Wraith2
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2022-03-09T21:12:24Z

I'm ok with taking this even if it's not all that commonly hit, but it would be nice to have more coverage.

I'm curious what lead you to choose to work on this item. Was there an example that was an inspiration? Is there some obvious C# pattern that gets us to generating one of these? If so, can you add a test case or two? Or, are there patterns or upstream canonicalization we should be doing to make this more likely?

The diff above seems to come from somebody being clever:

runtime/src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs

Lines 1040 to 1041 in a4b8893

    
           int ct = (int)comparisonType; 
        
           return (CompareOptions)((ct & -ct) << 28); // neg and shl

Wraith2 · 2022-03-09T21:30:35Z

I picked up the blsr issue which was up for grabs because it would improve perf slightly. After that having been looking at the bmi1 instructions I did andn because there were clearly some uses in the BCL. It felt like I was leaving a job unfinished to not look at the remaining instructions.

I'm not sure if I want to try and match the pattern for bestr and I'm not going to try tzcnt which left blsi' and blmsk, as you see blsionly has one use and I've got a branch withblmskbut that has no hits in the BCL. If this is merged there is a question of whether it's worth puttingBLMSK` in as well in case it's important for end users.

For each of blsi and blmsk I verified that they match the simple patterns that define them but there are no obvious c# patterns that generate those. Bit manipulation outside flags attributes is quite rare for most of the code we have. These things may be more useful in lower level scenarios than we see in this repro.

What sort of test coverage would you like to see?

AndyAyersMS · 2022-03-09T21:50:07Z

What sort of test coverage would you like to see?

A simple standalone test that exercises the code, added to the jit test tree somewhere. The SPMI coverage comes from a benchmark run which means it is likely to be hit in regular testing, but it would be good to have something we know for sure will exercise the new code.

Wraith2 · 2022-03-09T23:02:49Z

Something similar to https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Intrinsics/BitOperationsPopCount.cs then? I could add similar projects and cover the 4 bmi1 instructions in it.

AndyAyersMS · 2022-03-10T01:38:14Z

Something similar to https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Intrinsics/BitOperationsPopCount.cs then?

Yes, exactly.

Wraith2 · 2022-03-13T11:41:12Z

Test added, failures don't look related.

AndyAyersMS

Looks good. Thanks for adding those tests.

AndyAyersMS · 2022-03-13T20:33:08Z

Alpine libraries test failure is possibly region related? cc @mangod9

Assert failure(PID 24 [0x00000018], Thread: 54 [0x0036]): global_regions_to_decommit[kind].get_num_free_regions() == num_regions_to_decommit_before + (size_t)num_regions_to_decommit[kind]
    File: /__w/1/s/src/coreclr/gc/gc.cpp Line: 12432
    Image: /root/helix/work/correlation/dotnet

Windows libraries test failure (perhaps novel?)

    System.Runtime.InteropServices.HandleCollectorTests.TestHandleCollector [FAIL]
      High limit handle did not trigger a GC
      Expected: True
      Actual:   Fals

Mono LLVM AOT failure is (#66556)

2022-03-13T03:43:00.3828182Z   /__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/opt: mono_aot_hWo4gC/temp.bc: error: Invalid record (Producer: 'LLVM11.1.0' Reader: 'LLVM 11.1.0')
2022-03-13T03:43:00.3846354Z   AOT of image /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll failed.
2022-03-13T03:43:00.3853924Z   Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll
2022-03-13T03:43:00.3856091Z   AOTID 8C85F8F3-9F17-D6BC-372A-22815EB68F09
2022-03-13T03:43:00.3858270Z   Executing opt: "/__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/opt" -f -O2 -disable-tail-calls -place-safepoints -spp-all-backedges -mattr=crc,crypto -o "mono_aot_hWo4gC/temp.opt.bc" "mono_aot_hWo4gC/temp.bc"
2022-03-13T03:43:00.3899763Z /__w/1/s/src/mono/msbuild/aot-compile.proj(19,9): error MSB3073: The command "/__w/1/s/artifacts/bin/mono/Linux.arm64.Release/cross/linux-arm64/mono-aot-cross /__w/1/s/artifacts/tests/coreclr/Linux.arm64.Release/JIT/Performance/CodeQuality/Burgers/Burgers/Burgers.dll" exited with code 1.

Wraith2 · 2022-03-13T21:04:59Z

Any thoughts on whether I should push the blsmsk to complete the set? it has no hits in the runtime but it's pretty trivially implemented and may be of use to someone.

AndyAyersMS · 2022-03-13T21:19:59Z

Any thoughts on whether I should push the blsmsk to complete the set? it has no hits in the runtime but it's pretty trivially implemented and may be of use to someone.

If it's that easy to do, sure.

mangod9 · 2022-03-14T01:22:16Z

Yeah the alpine failure is related to Regions. @PeterSolMS had fixed a similar issue recently in #66495, but perhaps more work is required. Believe it might be intermittent failure.

AndyAyersMS · 2022-03-15T00:53:50Z

@Wraith2 thanks!

* implement blsi * add bmi intrinsics test projects * add using System for Console.

implement blsi

33003e2

ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 4, 2022

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Mar 4, 2022

deeprobin approved these changes Mar 9, 2022

View reviewed changes

Wraith2 added 2 commits March 12, 2022 02:05

add bmi intrinsics test projects

eebf872

add using System for Console.

3858284

AndyAyersMS approved these changes Mar 13, 2022

View reviewed changes

Wraith2 mentioned this pull request Mar 13, 2022

Add xarch blsmsk #66561

Merged

AndyAyersMS merged commit 6bf873a into dotnet:main Mar 15, 2022

Wraith2 deleted the add-xarch-blsi branch March 15, 2022 01:00

jakobbotsch mentioned this pull request Mar 16, 2022

JIT: Incorrect optimization of x & -x on x86 #66709

Closed

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022

Add xarch blsi (dotnet#66193)

4481298

* implement blsi * add bmi intrinsics test projects * add using System for Console.

JulieLeeMSFT mentioned this pull request Apr 1, 2022

What's new in .NET 7 Preview 3 [WIP] dotnet/core#7108

Closed

ghost locked as resolved and limited conversation to collaborators Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xarch `blsi` #66193

Add xarch `blsi` #66193

Wraith2 commented Mar 4, 2022 •

edited

Loading

ghost commented Mar 4, 2022

AndyAyersMS commented Mar 9, 2022

Wraith2 commented Mar 9, 2022

AndyAyersMS commented Mar 9, 2022

Wraith2 commented Mar 9, 2022

AndyAyersMS commented Mar 10, 2022

Wraith2 commented Mar 13, 2022

AndyAyersMS left a comment

AndyAyersMS commented Mar 13, 2022 •

edited

Loading

Wraith2 commented Mar 13, 2022

AndyAyersMS commented Mar 13, 2022

mangod9 commented Mar 14, 2022

AndyAyersMS commented Mar 15, 2022

Add xarch blsi #66193

Add xarch blsi #66193

Conversation

Wraith2 commented Mar 4, 2022 • edited Loading

ghost commented Mar 4, 2022

AndyAyersMS commented Mar 9, 2022

Wraith2 commented Mar 9, 2022

AndyAyersMS commented Mar 9, 2022

Wraith2 commented Mar 9, 2022

AndyAyersMS commented Mar 10, 2022

Wraith2 commented Mar 13, 2022

AndyAyersMS left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Mar 13, 2022 • edited Loading

Wraith2 commented Mar 13, 2022

AndyAyersMS commented Mar 13, 2022

mangod9 commented Mar 14, 2022

AndyAyersMS commented Mar 15, 2022

Add xarch `blsi` #66193

Add xarch `blsi` #66193

Wraith2 commented Mar 4, 2022 •

edited

Loading

AndyAyersMS commented Mar 13, 2022 •

edited

Loading