-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Vector.Ceiling / Vector.Floor #31993
Conversation
Currently, if |
Any clues on what |
I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free 🙂 cc @tannergooding |
That would be very, very nice (as I'm struggling to implement the JIT intrinsics part) but I don't know if there was any progression on it; #952 have been implemented by dotnet/coreclr#27401 so it should be doable at this point. |
#define ROUNDPS_TO_NEAREST_IMM 0b1000; | ||
#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001; | ||
#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010; | ||
#define ROUNDPS_TOWARD_ZERO_IMM 0b1011; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess now these magic numbers can be replaced with these named constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you think those macros should go to, in order for those constants to be shared across simdcodegenxarch.cpp
and codegenxarch.cpp
? As someone that hadn't done a lot of C++ I'm not sure if it can go to codegen.h
or if there's a better place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)
I would put the constants into emitxarch.h or codegen.h (under TARGET_XARCH) - they are supposed to be used together with a emitter or codegen function.
In my opinion, instrsxarch.h and other instrs*.h should be used for defining instructions' encodings only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT - I generally agree that instrs*.h should only be used for instruction encodings, but these are constants that are specific to the encodings. I don't think they belong in codegen.h, but emitxarch.h makes a lot of sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll defer this for now. Not only that I don't fully understand the codebase yet, I cannot tell if including another header (if we were to move it to emitxarch) just for that constant is a good idea as I lack experience with C++.
Changed to managed implementation w/ HWIntrinsics, and it looks quite promising: *************** After end code gen, before unwindEmit()
G_M27646_IG01: ; func=00, offs=000000H, size=0007H, bbWeight=1 PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0005: 000000 sub rsp, 40
IN0006: 000004 vzeroupper
G_M27646_IG02: ; offs=000007H, size=0015H, bbWeight=1 PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call System.Console:WriteLine(float)
IN0004: 00001B nop
G_M27646_IG03: ; offs=00001CH, size=0008H, bbWeight=1 PerfScore 2.25, epilog, nogc, extend
IN0007: 00001C vzeroupper
IN0008: 00001F add rsp, 40
IN0009: 000023 ret Here's what the JIT intrinsics version look on checked build: *************** After end code gen, before unwindEmit()
G_M27646_IG01: ; func=00, offs=000000H, size=0007H, bbWeight=1 PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0005: 000000 sub rsp, 40
IN0006: 000004 vzeroupper
G_M27646_IG02: ; offs=000007H, size=0015H, bbWeight=1 PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call System.Console:WriteLine(float)
IN0004: 00001B nop
G_M27646_IG03: ; offs=00001CH, size=0008H, bbWeight=1 PerfScore 2.25, epilog, nogc, extend
IN0007: 00001C vzeroupper
IN0008: 00001F add rsp, 40
IN0009: 000023 ret So it actually ends up generating identical code. |
One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet. (Most? at least FRINTP/FRINTM does not appear to have been exposed) Otherwise, IMO managed implementations are ways better due to readability / lack of complexity by having it in JIT layer. |
I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.
The ARM intrinsics are still a WIP and we haven't started the general porting of existing x86 code paths to have ARM equivalents yet. It is fine, IMO, to just accelerate x86 right now provided we have a tracking issue to also accelerate ARM64 once support for the relevant instructions is added. |
If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions. |
simply call
I meant only for new members (for a start). |
It should likely just be done in software. I would not feel comfortable codifying the SSE2 fallback for vectorized The scalar implementation of
For pre-SSE4.1 hardware, you would need to essentially take this algorithm and vectorize it (which, at first glance, should just involve a couple of masks and or'ing the results of the success/failure code paths together). That being said, I'm also not particularly concerned about that code path. SSE4.1 came out in ~2007 and so is nearly 13 years old. Even the x86 emulation layer in Windows 10 on ARM supports these instructions. Additionally, for S.P.Corelib with R2R and some AOT scenarios there is support for the SSE4.1 code path via a runtime check against a cached CPUID value; so the non-SSE4.1 codepath won't be hit in that scenario either. (Happy to hear input from others if there is some use-case/scenario I am missing). |
I think its the same in either scenario. There are many benefits to having the implementations in managed code; including lowering the entry barrier, faster iteration, improved codegen in several scenarios, and easier sharing with other runtimes but there are some drawbacks with the current JIT that might make it unacceptable, even if only done for a few methods initially. We'd need sign-off from the people I tagged above before we could start taking that route. |
Formatting Linux x64 is failing, but the artifacts generated is 0 bytes? 👀 |
....I definitely remember Visual Studio complaining about the |
@Gnbrkm41, could you please clarify if you are blocked by anything on this? |
I don't think so. It is fully implemented (At least I think), has tests and the tests seem to run fine. The new APIs also have XML comments on it as well. The only thing that would be nice would be addressing #31993 (comment) (replacing magic number elsewhere with constant defined here) but that strictly isn't necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CC. @CarolEidt, @echesakovMSFT, @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
#define ROUNDPS_TO_NEAREST_IMM 0b1000; | ||
#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001; | ||
#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010; | ||
#define ROUNDPS_TOWARD_ZERO_IMM 0b1011; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably instrsxarch.h would be a good place - other thoughts @dotnet/jit-contrib ?
In any event, I think it's something that can be deferred if you'd prefer (and perhaps open an issue)
Regarding implementing the |
@Gnbrkm41, would you be able to rebase onto or merge this with dotnet/master. Its been a while since the last commit and I'd like to validate this against the latest bits before merging. |
0d2fe6f
to
a83bb51
Compare
@tannergooding - Done, now we wait 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@stephentoub, was there a need to re-run the tests? I am happy to rebase this and run tests again if that is better. |
They hadn't been run in four days. I just wanted to make sure nothing crept in since that would cause merging this to break the build. |
* Also fix the old path for the new structure
* Oops, again
* Per review suggestion Co-Authored-By: Egor Chesakov <egor.chesakov@microsoft.com>
3781d1d
to
3f2cd3e
Compare
With no changes made on this branch, I'm not sure if merely re-running tests will actually make any difference? (Not including any changes in the testing environments, of course) The branch is ~70 commits behind the upstream master branch. I just rebased this branch on master. Hopefully that'll make things safer :^). |
That's why I closed it and re-opened it (rather than retriggering the tests in DevOps). CI will effectively rebase it on the latest master in that case. |
Oooooh, I did not know that. That makes sense. Thanks! |
Thanks for the contribution @Gnbrkm41! |
Fixes #20509
Utilises
ROUNDPS
/ROUNDPD
on x64,FRINTP
/FRINTM
on ARM64.Example Code:
Assembly generated: (W10 Pro Insiders 19559 x64, Intel i7-8700)
cc @tannergooding, @danmosemsft