Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow shuffle and other hwintrinsic that require a constant to stay intrinsic if the operand becomes constant later #102827

Merged
merged 12 commits into from
Jun 1, 2024

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 29, 2024

This builds on #102702 and updates the intrinsics that require a constant operand to check again in rationalization if the necessary input is constant. If the necessary input has since become a constant, we introduce the appropriate hwintrinsic node and otherwise we continue rewriting ourselves into a call.

This resolves #11062, resolves #11138, and resolves #9989


Scenario 1

static byte Test1()
{
    Vector128<byte> v = Vector128<byte>.Zero;
    byte size = 1;
    v = Sse2.ShiftRightLogical128BitLane(v, size);
    return Sse41.Extract(v, 0);
}

Before

; Method Program:Test1():ubyte (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       sub      rsp, 72

G_M000_IG02:                ;; offset=0x0004
       vxorps   xmm0, xmm0, xmm0
       vmovaps  xmmword ptr [rsp+0x20], xmm0
       lea      rdx, [rsp+0x20]
       lea      rcx, [rsp+0x30]
       mov      r8d, 1
       call     [System.Runtime.Intrinsics.X86.Sse2:ShiftRightLogical128BitLane(System.Runtime.Intrinsics.Vector128`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector128`1[ubyte]]
       vmovaps  xmm0, xmmword ptr [rsp+0x30]
       vpextrb  eax, xmm0, 0

G_M000_IG03:                ;; offset=0x0030
       add      rsp, 72
       ret      
; Total bytes of code: 53

After

; Method Program:Test1():ubyte (FullOpts)
G_M11031_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M11031_IG02:  ;; offset=0x0000
       vxorps   xmm0, xmm0, xmm0
       vpsrldq  xmm0, xmm0, 1
       vpextrb  eax, xmm0, 0
						;; size=15 bbWeight=1 PerfScore 4.33

G_M11031_IG03:  ;; offset=0x000F
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 16

Scenario 2

static Vector128<float> Test2()
{
    var x = Vector128.Create(1.0f, 2.0f, 3.0f, 4.0f);
    var y = Vector128.Create(3, 2, 1, 0);
    return Vector128.Shuffle(x + x, y);
}

Before

; Method Program:Test2():System.Runtime.Intrinsics.Vector128`1[float] (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 64
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x0008
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       vmovaps  xmmword ptr [rsp+0x30], xmm0
       vmovups  xmm0, xmmword ptr [reloc @RWD16]
       vmovaps  xmmword ptr [rsp+0x20], xmm0
       lea      rdx, [rsp+0x20]
       lea      r8, [rsp+0x30]
       mov      rcx, rbx
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       mov      rax, rbx

G_M000_IG03:                ;; offset=0x003A
       add      rsp, 64
       pop      rbx
       ret      
RWD00  	dq	0000000200000003h, 0000000000000001h
RWD16  	dq	4080000040000000h, 4100000040C00000h
; Total bytes of code: 64

After

; Method Program:Test2():System.Runtime.Intrinsics.Vector128`1[float] (FullOpts)
G_M49909_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M49909_IG02:  ;; offset=0x0000
       vpermilps xmm0, xmmword ptr [reloc @RWD00], 27
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=17 bbWeight=1 PerfScore 4.25

G_M49909_IG03:  ;; offset=0x0011
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
RWD00  	dq	4080000040000000h, 4100000040C00000h
; Total bytes of code: 18

…ntrinsic if the operand becomes constant later
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 29, 2024
@tannergooding tannergooding force-pushed the rationalize-hwintrinsic branch 2 times, most recently from 1f3d6d5 to 670404b Compare May 29, 2024 18:16
@tannergooding tannergooding force-pushed the rationalize-hwintrinsic branch 4 times, most recently from 03cf761 to e96a244 Compare June 1, 2024 00:23
@tannergooding tannergooding marked this pull request as ready for review June 1, 2024 18:18
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib this should be ready for review.

minopts size regressions are expected, but could be improved in a separate PR. Namely we don’t need to spill pass by value args and could avoid allocating a new local for the return buffer in some cases, since we know the intrinsic APIs aren’t going to do anything “bad” here.

fullopts improvements are namely in tests but a few in production areas, as we can identify many more constants now and thus emit the actual intrinsic.

As per the top post, this should unblock the PR that moves a large chunk of the xplat implementation into manage code and allow us to remove a significant chunk of complexity from the JIT

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
2 participants