Implement Vector.Ceiling / Vector.Floor #31993

Gnbrkm41 · 2020-02-09T08:59:44Z

Fixes #20509

Utilises ROUNDPS/ROUNDPD on x64, FRINTP/FRINTM on ARM64.
Example Code:

using System;
using System.Numerics;

public static class Program
{
    private static void Main()
    {
        Vector<float> vec = new Vector<float>(4.5f);
        vec = Vector.Ceiling(vec);
        Console.WriteLine(vec[0]);
    }
}

Assembly generated: (W10 Pro Insiders 19559 x64, Intel i7-8700)

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=001FH, bbWeight=1    PerfScore 7.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0009: 000000 push     rbp
IN000a: 000001 sub      rsp, 80
IN000b: 000005 vzeroupper
IN000c: 000008 lea      rbp, [rsp+50H]
IN000d: 00000D xor      rax, rax
IN000e: 00000F mov      qword ptr [V00 rbp-30H], rax
IN000f: 000013 mov      qword ptr [V00+0x8 rbp-28H], rax
IN0010: 000017 mov      qword ptr [V00+0x10 rbp-20H], rax
IN0011: 00001B mov      qword ptr [V00+0x18 rbp-18H], rax

G_M27646_IG02:        ; offs=00001FH, size=0029H, bbWeight=1    PerfScore 21.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0001: 00001F vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000028 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0003: 00002D vmovupd  ymm0, ymmword ptr[V00 rbp-30H]
IN0004: 000032 vroundps ymm0, ymm0, 10
IN0005: 000038 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0006: 00003D vmovss   xmm0, dword ptr [rbp-30H]
IN0007: 000042 call     System.Console:WriteLine(float)
IN0008: 000047 nop

G_M27646_IG03:        ; offs=000048H, size=0009H, bbWeight=1    PerfScore 3.00, epilog, nogc, extend

IN0012: 000048 vzeroupper
IN0013: 00004B lea      rsp, [rbp]
IN0014: 00004F pop      rbp
IN0015: 000050 ret

cc @tannergooding, @danmosemsft

Gnbrkm41 · 2020-02-09T10:10:02Z

Currently, if COMPlus_EnableSse41=0 the assertion in lsraxarch.cpp L1975 compiler->getSIMDSupportLevel() >= SIMD_SSE4_Supported fails. Need to make it not treat as intrinsic if we don't have SSE4.1...

Gnbrkm41 · 2020-02-09T10:39:16Z

Any clues on what ##[error]Failed to build "CoreCLR component". is about for *nix builds? 🙄

EgorBo · 2020-02-09T11:05:47Z

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free 🙂 cc @tannergooding

Gnbrkm41 · 2020-02-09T11:33:55Z

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C#

That would be very, very nice (as I'm struggling to implement the JIT intrinsics part) but I don't know if there was any progression on it; #952 have been implemented by dotnet/coreclr#27401 so it should be doable at this point.

src/libraries/System.Private.CoreLib/src/System/Numerics/Vector.cs

EgorBo · 2020-02-09T13:20:05Z

src/coreclr/src/jit/simdcodegenxarch.cpp

+#define ROUNDPS_TO_NEAREST_IMM 0b1000;
+#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001;
+#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010;
+#define ROUNDPS_TOWARD_ZERO_IMM 0b1011;


I guess now these magic numbers can be replaced with these named constants

Where do you think those macros should go to, in order for those constants to be shared across simdcodegenxarch.cpp and codegenxarch.cpp? As someone that hadn't done a lot of C++ I'm not sure if it can go to codegen.h or if there's a better place.

I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

I would put the constants into emitxarch.h or codegen.h (under TARGET_XARCH) - they are supposed to be used together with a emitter or codegen function.

In my opinion, instrsxarch.h and other instrs*.h should be used for defining instructions' encodings only.

@echesakovMSFT - I generally agree that instrs*.h should only be used for instruction encodings, but these are constants that are specific to the encodings. I don't think they belong in codegen.h, but emitxarch.h makes a lot of sense.

I think I'll defer this for now. Not only that I don't fully understand the codebase yet, I cannot tell if including another header (if we were to move it to emitxarch) just for that constant is a good idea as I lack experience with C++.

Gnbrkm41 · 2020-02-09T15:42:06Z

Changed to managed implementation w/ HWIntrinsics, and it looks quite promising:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

Here's what the JIT intrinsics version look on checked build:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

So it actually ends up generating identical code.

Gnbrkm41 · 2020-02-09T15:44:27Z

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet. (Most? at least FRINTP/FRINTM does not appear to have been exposed) Otherwise, IMO managed implementations are ways better due to readability / lack of complexity by having it in JIT layer.

tannergooding · 2020-02-09T16:20:36Z

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.
I know there have been a few improvements to inlining since then, but they would still need to give sign-off before I would feel comfortable giving the go ahead.

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet.

The ARM intrinsics are still a WIP and we haven't started the general porting of existing x86 code paths to have ARM equivalents yet. It is fine, IMO, to just accelerate x86 right now provided we have a tracking issue to also accelerate ARM64 once support for the relevant instructions is added.

Gnbrkm41 · 2020-02-09T16:25:06Z

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

EgorBo · 2020-02-09T16:44:25Z

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

simply call Math.Ceiling/Math.Floor for each component?

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.

I meant only for new members (for a start).

tannergooding · 2020-02-09T16:52:01Z

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1?

It should likely just be done in software. I would not feel comfortable codifying the SSE2 fallback for vectorized Ceil/Floor in the JIT.

The scalar implementation of Ceil/Floor for x64 Windows is essentially:

For pre-SSE4.1 hardware, you would need to essentially take this algorithm and vectorize it (which, at first glance, should just involve a couple of masks and or'ing the results of the success/failure code paths together).

That being said, I'm also not particularly concerned about that code path. SSE4.1 came out in ~2007 and so is nearly 13 years old. Even the x86 emulation layer in Windows 10 on ARM supports these instructions. Additionally, for S.P.Corelib with R2R and some AOT scenarios there is support for the SSE4.1 code path via a runtime check against a cached CPUID value; so the non-SSE4.1 codepath won't be hit in that scenario either. (Happy to hear input from others if there is some use-case/scenario I am missing).

tannergooding · 2020-02-09T16:55:53Z

I meant only for new members (for a start).

I think its the same in either scenario. There are many benefits to having the implementations in managed code; including lowering the entry barrier, faster iteration, improved codegen in several scenarios, and easier sharing with other runtimes but there are some drawbacks with the current JIT that might make it unacceptable, even if only done for a few methods initially.

We'd need sign-off from the people I tagged above before we could start taking that route.

src/coreclr/src/jit/simd.cpp

Gnbrkm41 · 2020-02-09T18:45:08Z

Formatting Linux x64 is failing, but the artifacts generated is 0 bytes? 👀

Gnbrkm41 · 2020-02-09T19:26:01Z

....I definitely remember Visual Studio complaining about the #endif missing something at the end, but looks like it shouldn't be there. 😳

tannergooding · 2020-03-05T22:20:22Z

@Gnbrkm41, could you please clarify if you are blocked by anything on this?

Gnbrkm41 · 2020-03-06T04:51:48Z

I don't think so. It is fully implemented (At least I think), has tests and the tests seem to run fine. The new APIs also have XML comments on it as well.

The only thing that would be nice would be addressing #31993 (comment) (replacing magic number elsewhere with constant defined here) but that strictly isn't necessary.

tannergooding

LGTM

tannergooding · 2020-03-11T22:54:43Z

CC. @CarolEidt, @echesakovMSFT, @dotnet/jit-contrib

CarolEidt

LGTM - thanks!

CarolEidt · 2020-03-12T00:44:50Z

src/coreclr/src/jit/simdcodegenxarch.cpp

+#define ROUNDPS_TO_NEAREST_IMM 0b1000;
+#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001;
+#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010;
+#define ROUNDPS_TOWARD_ZERO_IMM 0b1011;


I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

CarolEidt · 2020-03-12T15:57:07Z

Regarding implementing the Vector<T> intrinsics in terms of the hardware intrinsics, I think that's something that we should tackle separately - not necessarily for new intrinsics, but just starting with a small set, and evaluating the behavior in terms of impact to the amount of IR generated and the code quality in the presence of inlining and wrapping, etc.

tannergooding · 2020-03-12T16:07:36Z

@Gnbrkm41, would you be able to rebase onto or merge this with dotnet/master. Its been a while since the last commit and I'd like to validate this against the latest bits before merging.

Gnbrkm41 · 2020-03-12T16:20:13Z

@tannergooding - Done, now we wait 😄

src/coreclr/src/jit/simdcodegenxarch.cpp

echesakov

LGTM

Gnbrkm41 · 2020-03-16T13:36:58Z

@stephentoub, was there a need to re-run the tests? I am happy to rebase this and run tests again if that is better.

stephentoub · 2020-03-16T13:38:09Z

was there a need to re-run the tests?

They hadn't been run in four days. I just wanted to make sure nothing crept in since that would cause merging this to break the build.

* Also fix the old path for the new structure

* Oops

* Oops, again

* Per review suggestion Co-Authored-By: Egor Chesakov <egor.chesakov@microsoft.com>

Gnbrkm41 · 2020-03-16T13:47:17Z

With no changes made on this branch, I'm not sure if merely re-running tests will actually make any difference? (Not including any changes in the testing environments, of course) The branch is ~70 commits behind the upstream master branch.

I just rebased this branch on master. Hopefully that'll make things safer :^).

stephentoub · 2020-03-16T13:48:44Z

With no changes made on this branch, I'm not sure if merely re-running tests will actually make any difference?

That's why I closed it and re-opened it (rather than retriggering the tests in DevOps). CI will effectively rebase it on the latest master in that case.

Gnbrkm41 · 2020-03-16T13:49:42Z

CI will effectively rebase it on the latest master in that case.

Oooooh, I did not know that. That makes sense. Thanks!

tannergooding · 2020-03-16T22:10:23Z

Thanks for the contribution @Gnbrkm41!

Dotnet-GitSync-Bot added the area-System.Numerics label Feb 9, 2020

Gnbrkm41 changed the title ~~Implement Vector.Ceiling / Vector.Floor~~ [WIP] Implement Vector.Ceiling / Vector.Floor Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Numerics/Vector.cs Show resolved Hide resolved

EgorBo reviewed Feb 9, 2020

View reviewed changes

tannergooding reviewed Feb 9, 2020

View reviewed changes

src/coreclr/src/jit/simd.cpp Show resolved Hide resolved

Gnbrkm41 changed the title ~~[WIP] Implement Vector.Ceiling / Vector.Floor~~ Implement Vector.Ceiling / Vector.Floor Feb 9, 2020

maryamariyan added the new-api-needs-documentation label Feb 13, 2020

tannergooding approved these changes Mar 11, 2020

View reviewed changes

CarolEidt approved these changes Mar 12, 2020

View reviewed changes

Gnbrkm41 force-pushed the vectorceilfloor branch from 0d2fe6f to a83bb51 Compare March 12, 2020 16:18

echesakov reviewed Mar 12, 2020

View reviewed changes

src/coreclr/src/jit/simdcodegenxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 12, 2020

View reviewed changes

src/coreclr/src/jit/simdcodegenxarch.cpp Outdated Show resolved Hide resolved

echesakov approved these changes Mar 12, 2020

View reviewed changes

stephentoub approved these changes Mar 16, 2020

View reviewed changes

stephentoub closed this Mar 16, 2020

stephentoub reopened this Mar 16, 2020

Gnbrkm41 and others added 12 commits March 16, 2020 22:45

Add Ceil/Floor to S.P.CoreLib and expose publicly

1388308

Make Ceil/Floor JIT intrinsic

f813695

Add missing identifier for closing endif

618c5b8

Add JIT test for Ceil/Floor

89b0637

Add test for Ceil/Floor in S.Numerics.Vectors

b089d01

* Also fix the old path for the new structure

Apply jit-format

5c70229

Do not attempt to treat Ceil/Floor if SSE4.1 isn't available

e6c59dc

Mark Vector<T>.Ceiling/Floor as Intrinsic

25157de

Add missing underscore

1f052f5

Revert "Add missing idenfitier for closing endif"

1f1b1bf

* Oops

Add missing breaks

e182467

* Oops, again

Simplify switch case

3f2cd3e

* Per review suggestion Co-Authored-By: Egor Chesakov <egor.chesakov@microsoft.com>

Gnbrkm41 force-pushed the vectorceilfloor branch from 3781d1d to 3f2cd3e Compare March 16, 2020 13:47

tannergooding merged commit e37df8b into dotnet:master Mar 16, 2020

tannergooding mentioned this pull request Mar 16, 2020

API Proposal: Ceil, Floor for Vector<T> #20509

Closed

Gnbrkm41 deleted the vectorceilfloor branch March 18, 2020 16:23

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Implement Vector.Ceiling / Vector.Floor #31993

Implement Vector.Ceiling / Vector.Floor #31993

Conversation

Gnbrkm41 commented Feb 9, 2020 • edited Loading

Gnbrkm41 commented Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020

EgorBo commented Feb 9, 2020 • edited Loading

Gnbrkm41 commented Feb 9, 2020

EgorBo Feb 9, 2020

Choose a reason for hiding this comment

Gnbrkm41 Feb 9, 2020 • edited Loading

Choose a reason for hiding this comment

CarolEidt Mar 12, 2020

Choose a reason for hiding this comment

echesakov Mar 12, 2020 • edited Loading

Choose a reason for hiding this comment

CarolEidt Mar 12, 2020

Choose a reason for hiding this comment

Gnbrkm41 Mar 12, 2020

Choose a reason for hiding this comment

Gnbrkm41 commented Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020 • edited Loading

tannergooding commented Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020

EgorBo commented Feb 9, 2020

tannergooding commented Feb 9, 2020

tannergooding commented Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020

Gnbrkm41 commented Feb 9, 2020

tannergooding commented Mar 5, 2020

Gnbrkm41 commented Mar 6, 2020

tannergooding left a comment

Choose a reason for hiding this comment

tannergooding commented Mar 11, 2020

CarolEidt left a comment

Choose a reason for hiding this comment

CarolEidt Mar 12, 2020

Choose a reason for hiding this comment

CarolEidt commented Mar 12, 2020

tannergooding commented Mar 12, 2020

Gnbrkm41 commented Mar 12, 2020

echesakov left a comment

Choose a reason for hiding this comment

Gnbrkm41 commented Mar 16, 2020

stephentoub commented Mar 16, 2020

Gnbrkm41 commented Mar 16, 2020

stephentoub commented Mar 16, 2020

Gnbrkm41 commented Mar 16, 2020 • edited Loading

tannergooding commented Mar 16, 2020

Gnbrkm41 commented Feb 9, 2020 •

edited

Loading

EgorBo commented Feb 9, 2020 •

edited

Loading

Gnbrkm41 Feb 9, 2020 •

edited

Loading

echesakov Mar 12, 2020 •

edited

Loading

Gnbrkm41 commented Feb 9, 2020 •

edited

Loading

Gnbrkm41 commented Mar 16, 2020 •

edited

Loading