Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Vector.Ceiling / Vector.Floor #31993

Merged
merged 12 commits into from
Mar 16, 2020

Conversation

Gnbrkm41
Copy link
Contributor

@Gnbrkm41 Gnbrkm41 commented Feb 9, 2020

Fixes #20509

Utilises ROUNDPS/ROUNDPD on x64, FRINTP/FRINTM on ARM64.
Example Code:

using System;
using System.Numerics;

public static class Program
{
    private static void Main()
    {
        Vector<float> vec = new Vector<float>(4.5f);
        vec = Vector.Ceiling(vec);
        Console.WriteLine(vec[0]);
    }
}

Assembly generated: (W10 Pro Insiders 19559 x64, Intel i7-8700)

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=001FH, bbWeight=1    PerfScore 7.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0009: 000000 push     rbp
IN000a: 000001 sub      rsp, 80
IN000b: 000005 vzeroupper
IN000c: 000008 lea      rbp, [rsp+50H]
IN000d: 00000D xor      rax, rax
IN000e: 00000F mov      qword ptr [V00 rbp-30H], rax
IN000f: 000013 mov      qword ptr [V00+0x8 rbp-28H], rax
IN0010: 000017 mov      qword ptr [V00+0x10 rbp-20H], rax
IN0011: 00001B mov      qword ptr [V00+0x18 rbp-18H], rax

G_M27646_IG02:        ; offs=00001FH, size=0029H, bbWeight=1    PerfScore 21.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0001: 00001F vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000028 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0003: 00002D vmovupd  ymm0, ymmword ptr[V00 rbp-30H]
IN0004: 000032 vroundps ymm0, ymm0, 10
IN0005: 000038 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0006: 00003D vmovss   xmm0, dword ptr [rbp-30H]
IN0007: 000042 call     System.Console:WriteLine(float)
IN0008: 000047 nop

G_M27646_IG03:        ; offs=000048H, size=0009H, bbWeight=1    PerfScore 3.00, epilog, nogc, extend

IN0012: 000048 vzeroupper
IN0013: 00004B lea      rsp, [rbp]
IN0014: 00004F pop      rbp
IN0015: 000050 ret

cc @tannergooding, @danmosemsft

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

Currently, if COMPlus_EnableSse41=0 the assertion in lsraxarch.cpp L1975 compiler->getSIMDSupportLevel() >= SIMD_SSE4_Supported fails. Need to make it not treat as intrinsic if we don't have SSE4.1...

@Gnbrkm41 Gnbrkm41 changed the title Implement Vector.Ceiling / Vector.Floor [WIP] Implement Vector.Ceiling / Vector.Floor Feb 9, 2020
@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

Any clues on what ##[error]Failed to build "CoreCLR component". is about for *nix builds? 🙄

@EgorBo
Copy link
Member

EgorBo commented Feb 9, 2020

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free 🙂 cc @tannergooding

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C#

That would be very, very nice (as I'm struggling to implement the JIT intrinsics part) but I don't know if there was any progression on it; #952 have been implemented by dotnet/coreclr#27401 so it should be doable at this point.

#define ROUNDPS_TO_NEAREST_IMM 0b1000;
#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001;
#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010;
#define ROUNDPS_TOWARD_ZERO_IMM 0b1011;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess now these magic numbers can be replaced with these named constants

Copy link
Contributor Author

@Gnbrkm41 Gnbrkm41 Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you think those macros should go to, in order for those constants to be shared across simdcodegenxarch.cpp and codegenxarch.cpp? As someone that hadn't done a lot of C++ I'm not sure if it can go to codegen.h or if there's a better place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

Copy link
Contributor

@echesakov echesakov Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

I would put the constants into emitxarch.h or codegen.h (under TARGET_XARCH) - they are supposed to be used together with a emitter or codegen function.

In my opinion, instrsxarch.h and other instrs*.h should be used for defining instructions' encodings only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT - I generally agree that instrs*.h should only be used for instruction encodings, but these are constants that are specific to the encodings. I don't think they belong in codegen.h, but emitxarch.h makes a lot of sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll defer this for now. Not only that I don't fully understand the codebase yet, I cannot tell if including another header (if we were to move it to emitxarch) just for that constant is a good idea as I lack experience with C++.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

Changed to managed implementation w/ HWIntrinsics, and it looks quite promising:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

Here's what the JIT intrinsics version look on checked build:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

So it actually ends up generating identical code.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet. (Most? at least FRINTP/FRINTM does not appear to have been exposed) Otherwise, IMO managed implementations are ways better due to readability / lack of complexity by having it in JIT layer.

@tannergooding
Copy link
Member

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.
I know there have been a few improvements to inlining since then, but they would still need to give sign-off before I would feel comfortable giving the go ahead.

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet.

The ARM intrinsics are still a WIP and we haven't started the general porting of existing x86 code paths to have ARM equivalents yet. It is fine, IMO, to just accelerate x86 right now provided we have a tracking issue to also accelerate ARM64 once support for the relevant instructions is added.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

@EgorBo
Copy link
Member

EgorBo commented Feb 9, 2020

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

simply call Math.Ceiling/Math.Floor for each component?

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.

I meant only for new members (for a start).

@tannergooding
Copy link
Member

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1?

It should likely just be done in software. I would not feel comfortable codifying the SSE2 fallback for vectorized Ceil/Floor in the JIT.

The scalar implementation of Ceil/Floor for x64 Windows is essentially:

For pre-SSE4.1 hardware, you would need to essentially take this algorithm and vectorize it (which, at first glance, should just involve a couple of masks and or'ing the results of the success/failure code paths together).

That being said, I'm also not particularly concerned about that code path. SSE4.1 came out in ~2007 and so is nearly 13 years old. Even the x86 emulation layer in Windows 10 on ARM supports these instructions. Additionally, for S.P.Corelib with R2R and some AOT scenarios there is support for the SSE4.1 code path via a runtime check against a cached CPUID value; so the non-SSE4.1 codepath won't be hit in that scenario either. (Happy to hear input from others if there is some use-case/scenario I am missing).

@tannergooding
Copy link
Member

I meant only for new members (for a start).

I think its the same in either scenario. There are many benefits to having the implementations in managed code; including lowering the entry barrier, faster iteration, improved codegen in several scenarios, and easier sharing with other runtimes but there are some drawbacks with the current JIT that might make it unacceptable, even if only done for a few methods initially.

We'd need sign-off from the people I tagged above before we could start taking that route.

@Gnbrkm41 Gnbrkm41 changed the title [WIP] Implement Vector.Ceiling / Vector.Floor Implement Vector.Ceiling / Vector.Floor Feb 9, 2020
@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

Formatting Linux x64 is failing, but the artifacts generated is 0 bytes? 👀

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Feb 9, 2020

....I definitely remember Visual Studio complaining about the #endif missing something at the end, but looks like it shouldn't be there. 😳

@tannergooding
Copy link
Member

@Gnbrkm41, could you please clarify if you are blocked by anything on this?

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Mar 6, 2020

I don't think so. It is fully implemented (At least I think), has tests and the tests seem to run fine. The new APIs also have XML comments on it as well.

The only thing that would be nice would be addressing #31993 (comment) (replacing magic number elsewhere with constant defined here) but that strictly isn't necessary.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT, @dotnet/jit-contrib

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

#define ROUNDPS_TO_NEAREST_IMM 0b1000;
#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001;
#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010;
#define ROUNDPS_TOWARD_ZERO_IMM 0b1011;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)

@CarolEidt
Copy link
Contributor

Regarding implementing the Vector<T> intrinsics in terms of the hardware intrinsics, I think that's something that we should tackle separately - not necessarily for new intrinsics, but just starting with a small set, and evaluating the behavior in terms of impact to the amount of IR generated and the code quality in the presence of inlining and wrapping, etc.

@tannergooding
Copy link
Member

@Gnbrkm41, would you be able to rebase onto or merge this with dotnet/master. Its been a while since the last commit and I'd like to validate this against the latest bits before merging.

@Gnbrkm41
Copy link
Contributor Author

@tannergooding - Done, now we wait 😄

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stephentoub stephentoub reopened this Mar 16, 2020
@Gnbrkm41
Copy link
Contributor Author

@stephentoub, was there a need to re-run the tests? I am happy to rebase this and run tests again if that is better.

@stephentoub
Copy link
Member

was there a need to re-run the tests?

They hadn't been run in four days. I just wanted to make sure nothing crept in since that would cause merging this to break the build.

@Gnbrkm41
Copy link
Contributor Author

With no changes made on this branch, I'm not sure if merely re-running tests will actually make any difference? (Not including any changes in the testing environments, of course) The branch is ~70 commits behind the upstream master branch.

I just rebased this branch on master. Hopefully that'll make things safer :^).

@stephentoub
Copy link
Member

With no changes made on this branch, I'm not sure if merely re-running tests will actually make any difference?

That's why I closed it and re-opened it (rather than retriggering the tests in DevOps). CI will effectively rebase it on the latest master in that case.

@Gnbrkm41
Copy link
Contributor Author

Gnbrkm41 commented Mar 16, 2020

CI will effectively rebase it on the latest master in that case.

Oooooh, I did not know that. That makes sense. Thanks!

@tannergooding
Copy link
Member

Thanks for the contribution @Gnbrkm41!

@tannergooding tannergooding merged commit e37df8b into dotnet:master Mar 16, 2020
@Gnbrkm41 Gnbrkm41 deleted the vectorceilfloor branch March 18, 2020 16:23
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Proposal: Ceil, Floor for Vector<T>
8 participants