-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Some test fixes for the x86 HWIntrinsics #18734
Conversation
FYI. @fiigii, @eerhardt, @CarolEidt |
…sertScalar HWIntrinsic test templates
@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic @dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic @dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic |
@@ -241,6 +241,10 @@ public static class Avx | |||
/// </summary> | |||
public static byte Extract(Vector256<byte> value, byte index) | |||
{ | |||
if (!IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we want any methods to work when IsSupported
is false, even helpers.
We should also investigate if this is really better than (the below is what all three major native compilers do):
vpextrb ; When index is into the lower 128-bits
or
vextractf128 ; When index is into the upper 128-bits
vpextrb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also investigate if this is really better than
Actually, RyuJIT also generates the above code as well as native compilers. The managed implementation here is just for the non-constant fallback. So, I believe this is really better than calling into one (lower 128-bits) or two (upper 128-bits) large jump-tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I'm not so sure here. I always forget that the jit implementation exists (when I see the managed implementation here) and this is one of the smaller jump tables that we could generate (considering we only need to handle 32 cases max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say that, as a min bar, we should add a comment indicating that this is the software fallback.
However, I think that having this "different" software fallback goes against the principle of ensuring the helper indirect invocations execute the same instructions that the hardware accelerated path uses.
As it is now, we need to ensure that the software behavior is semantically equivalent to the hardware instructions that would have been executed (and it has already had to have been fixed a number of times due to minor issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think that having this "different" software fallback goes against the principle of ensuring the helper indirect invocations execute the same instructions that the hardware accelerated path uses.
I agree; I think there's a lot of value in the uniformity of the current non-constant approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, some native compilers (i.e., clang) also optimize the non-const Extract/Insert to the memory operations. It would be better to get some data when we unify the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, I think the point is that the value of uniformity far outweighs any slight perf increase (if one exists) for the non-constant case (especially considering that the non-constant
case already takes a perf hit from the call indirection/etc and is expected to be an edge case anyways).
For the constant case, all three compilers definitely emit the SIMD instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I just meant the non-const situations.
@@ -523,6 +539,11 @@ public static Vector256<uint> Insert(Vector256<uint> value, uint data, byte inde | |||
/// </summary> | |||
public static Vector256<long> Insert(Vector256<long> value, long data, byte index) | |||
{ | |||
if (IntPtr.Size != 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have some long/ulong
helpers work on 32-bit and others not (disabled here since these were the only 64-bit helpers that were "working").
We should have a discussion on whether we:
- Always support 64-bit overloads (when
IsSupported
is true)- This will require emulation on 32-bit
- There is one instruction that I know of (
movq
) that deals with scalars and works on 32-bit - The vector intrinsics don't directly deal with 64-bit registers so
Vector128<long/ulong>
tends to work
- Sometimes supports 64-bit overloads (depending on the underlying instruction and if it is a helper or not)
- If we go this route, we should probably come up with a partitioning scheme for telling users these aren't supported on 32-bit
- Never support 64-bit overloads on 32-bit OS
CC. @eerhardt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have some long/ulong helpers work on 32-bit and others not
Agree. I think we should disable these two helpers for 32-bit platforms, native compilers also do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with this change, all long/ulong
operations will throw PNSE when running in a 32-bit process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt, not quite.
There are two basic types of operations:
- Those that take/return only
Vector128<long>
orVector128<ulong>
- These are fine in both 32-bit and 64-bit mode, as they operation on 128-bit registers (which exist in both 32-bit and 64-bit modes), on 64-bit memory (Scalar
Vector128<T>
) , or on 128/256-bit memory (PackedVector128<T>
and PackedVector256<T>
)
- These are fine in both 32-bit and 64-bit mode, as they operation on 128-bit registers (which exist in both 32-bit and 64-bit modes), on 64-bit memory (Scalar
- Those that take/return
long
orulong
directly, which falls into two-subcategories- Those that require a 64-bit register and cannot work on 32-bit
Sse2.StoreNonTemporal
andSse2.ConvertToInt64(Vector128<double>)
are examples that don't work, as they require at least one 64-bit register
- Those that can operate directly on 64-bit memory, and don't require a register
- These sometimes work on 32-bit machines
ConvertToInt64(Vector128<long>)
is an example that, as it uses a special encoding of themovq
instructionConvertScalarToVector128Double(Vector128<double>, long)
is an example of one that supports taking just a 64-bit memory address, but is still not encodable on a 32-bit machine
- Those that require a 64-bit register and cannot work on 32-bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we shouldn't do anything tantamount to emulation when executing on a 32-bit machine. That said, I'm wondering if there are suggestions as to how these limitations/exclusions are reflected in the usage model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm32/64 is differentiating by the Arm
and Arm.Arm64
namespaces (however, this is also partially due to these being entirely different architectures, rather than hierarchical, like x86/x64).
Exposing a new class would probably work, but would also break the make Sse2 inherit from Sse
proposal since we don't have multiple inheritance or interfaces that can declare static methods.
Duplicating all the members under a separate x64
namespace would also work, but that is a lot of duplicated metadata for relatively few methods.
I had also thought of just updating the reference assembly to have a x86/x64 specific version; but we are exposing all architectures from a single package so the cross-architecture story is nicer (and so you don't need #ifdef
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relying on an analyzer/documentation to help notify users of these incompatibilities seems like it might the easiest way to do this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
throw new PlatformNotSupportedException(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with Extract
, we should determine if this is "better" than what all three native compilers do, which is:
vpinsrb ; When index is into the lower 128-bits
vinsertf128
or
vextractf128
vpinsrb ; When index is into the upper 128-bits
vinsertf128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the comment to Extract
, the The managed implementation here is just the non-constant fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
I know this isn't part of this change, but do we want to do a range check on Refers to: src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx.cs:552 in c3945e5. [](commit_id = c3945e5, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I only looked at the test changes in the x86/Shared directory - I assume the rest are all auto-generated.
Merging to unblock other PRs. I've logged issues for the two pending discussion points: |
Avx.MaskLoad
tests (temporarily removed due to the API changes made in Improve Intel hardware intrinsic APIs #17637)try/catch
blocks from theExtractScalar
andInsertScalar
testsPNSE
should be thrown is in theUnsupportedScenario
test, which is the only test run whenIsSupported
is false. TheUnsupportedScenario
itself catches the PNSE and sets the test to pass if the exception was caughtAvx.Extract
methods to throw PNSE whenIsSupported
is falseAvx
test failures forCOMPlus_FeatureSIMD=0
andCOMPlus_EnableAVX=0