-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xarch blsi
#66193
Add xarch blsi
#66193
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis adds a lowering for the pattern The spmi replay is clean and there is only one asm diff: ; Assembly listing for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T00] ( 6, 5.50) int -> rsi single-def
+; V00 arg0 [V00,T00] ( 5, 4.50) int -> rsi single-def
;* V01 loc0 [V01 ] ( 0, 0 ) int -> zero-ref
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 3, 2 ) int -> rcx
; V04 tmp2 [V04,T01] ( 2, 4 ) bool -> rcx "Inlining Arg"
; V05 cse0 [V05,T03] ( 3, 1.50) ref -> rdx "CSE - moderate"
;
; Lcl frame size = 32
G_M29069_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rsi
sub rsp, 32
mov esi, ecx
;; bbWeight=1 PerfScore 1.50
G_M29069_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
cmp esi, 4
je SHORT G_M29069_IG04
;; bbWeight=1 PerfScore 1.25
G_M29069_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
cmp esi, 5
sete cl
movzx rcx, cl
jmp SHORT G_M29069_IG05
;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG04: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
mov ecx, 1
;; bbWeight=0.50 PerfScore 0.12
G_M29069_IG05: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
movzx rcx, cl
test ecx, ecx
jne SHORT G_M29069_IG07
;; bbWeight=1 PerfScore 1.50
G_M29069_IG06: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
mov rcx, 0xD1FFAB1E ; string handle
mov rdx, gword ptr [rcx]
; gcrRegs +[rdx]
mov rcx, rdx
; gcrRegs +[rcx]
call hackishModuleName:hackishMethodName()
; gcrRegs -[rcx rdx]
; gcr arg pop 0
;; bbWeight=0.50 PerfScore 1.75
G_M29069_IG07: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
+ blsi eax, esi
- mov eax, esi
- neg eax
- and eax, esi
shl eax, 28
+ ;; bbWeight=1 PerfScore 1.00
- ;; bbWeight=1 PerfScore 1.25
G_M29069_IG08: ; , epilog, nogc, extend
add rsp, 32
pop rsi
ret
;; bbWeight=1 PerfScore 1.75
+; Total bytes of code 70, prolog size 5, PerfScore 17.63, instruction count 22, allocated bytes for code 70 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
-; Total bytes of code 71, prolog size 5, PerfScore 17.98, instruction count 24, allocated bytes for code 71 (MethodHash=20958e72) for method System.String:GetCompareOptionsFromOrdinalStringComparison(int):int
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x05
CountOfUnwindCodes: 2
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 3 * 8 + 8 = 32 = 0x20
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that if even if this is closed anyone else ever wonders why /cc @dotnet/jit-contrib
|
I'm ok with taking this even if it's not all that commonly hit, but it would be nice to have more coverage. I'm curious what lead you to choose to work on this item. Was there an example that was an inspiration? Is there some obvious C# pattern that gets us to generating one of these? If so, can you add a test case or two? Or, are there patterns or upstream canonicalization we should be doing to make this more likely? The diff above seems to come from somebody being clever: runtime/src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs Lines 1040 to 1041 in a4b8893
|
I picked up the I'm not sure if I want to try and match the pattern for For each of What sort of test coverage would you like to see? |
A simple standalone test that exercises the code, added to the jit test tree somewhere. The SPMI coverage comes from a benchmark run which means it is likely to be hit in regular testing, but it would be good to have something we know for sure will exercise the new code. |
Something similar to https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Intrinsics/BitOperationsPopCount.cs then? I could add similar projects and cover the 4 bmi1 instructions in it. |
Yes, exactly. |
Test added, failures don't look related. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for adding those tests.
Alpine libraries test failure is possibly region related? cc @mangod9
Windows libraries test failure (perhaps novel?)
Mono LLVM AOT failure is (#66556)
|
Any thoughts on whether I should push the |
If it's that easy to do, sure. |
Yeah the alpine failure is related to Regions. @PeterSolMS had fixed a similar issue recently in #66495, but perhaps more work is required. Believe it might be intermittent failure. |
@Wraith2 thanks! |
This adds a lowering for the pattern
AND(x, NEG(x))
to the ExtractLowestSetBit hwintrinsic.The spmi replay is clean and there is only one asm diff:
The value is low but if it is ever used it is an improvement. I chose to open the PR even though the value is low so that even if this is closed anyone else ever wonders why
blsi
isn't used can see the results of implementing it./cc @dotnet/jit-contrib