Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Some test fixes for the x86 HWIntrinsics #18734

Merged
merged 3 commits into from
Jul 2, 2018
Merged

Some test fixes for the x86 HWIntrinsics #18734

merged 3 commits into from
Jul 2, 2018

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 30, 2018

  • This adds back the Avx.MaskLoad tests (temporarily removed due to the API changes made in Improve Intel hardware intrinsic APIs #17637)
  • This removes unnecessary try/catch blocks from the ExtractScalar and InsertScalar tests
    • The only time PNSE should be thrown is in the UnsupportedScenario test, which is the only test run when IsSupported is false. The UnsupportedScenario itself catches the PNSE and sets the test to pass if the exception was caught
  • This fixes the Avx.Extract methods to throw PNSE when IsSupported is false
    • This is the source of the current Avx test failures for COMPlus_FeatureSIMD=0 and COMPlus_EnableAVX=0

@tannergooding
Copy link
Member Author

FYI. @fiigii, @eerhardt, @CarolEidt

@tannergooding
Copy link
Member Author

tannergooding commented Jun 30, 2018

I am working on fixing up the Sse41.Insert float tests that were removed with due to the API changes made in #17637 as well.

Edit: These are now handled by #18735

@tannergooding
Copy link
Member Author

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

@@ -241,6 +241,10 @@ public static class Avx
/// </summary>
public static byte Extract(Vector256<byte> value, byte index)
{
if (!IsSupported)
Copy link
Member Author

@tannergooding tannergooding Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we want any methods to work when IsSupported is false, even helpers.

We should also investigate if this is really better than (the below is what all three major native compilers do):

vpextrb        ; When index is into the lower 128-bits

or

vextractf128   ; When index is into the upper 128-bits
vpextrb

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also investigate if this is really better than

Actually, RyuJIT also generates the above code as well as native compilers. The managed implementation here is just for the non-constant fallback. So, I believe this is really better than calling into one (lower 128-bits) or two (upper 128-bits) large jump-tables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I'm not so sure here. I always forget that the jit implementation exists (when I see the managed implementation here) and this is one of the smaller jump tables that we could generate (considering we only need to handle 32 cases max)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say that, as a min bar, we should add a comment indicating that this is the software fallback.

However, I think that having this "different" software fallback goes against the principle of ensuring the helper indirect invocations execute the same instructions that the hardware accelerated path uses.

As it is now, we need to ensure that the software behavior is semantically equivalent to the hardware instructions that would have been executed (and it has already had to have been fixed a number of times due to minor issues).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think that having this "different" software fallback goes against the principle of ensuring the helper indirect invocations execute the same instructions that the hardware accelerated path uses.

I agree; I think there's a lot of value in the uniformity of the current non-constant approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, some native compilers (i.e., clang) also optimize the non-const Extract/Insert to the memory operations. It would be better to get some data when we unify the implementation.

Copy link
Member Author

@tannergooding tannergooding Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiigii, I think the point is that the value of uniformity far outweighs any slight perf increase (if one exists) for the non-constant case (especially considering that the non-constant case already takes a perf hit from the call indirection/etc and is expected to be an edge case anyways).

For the constant case, all three compilers definitely emit the SIMD instructions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just meant the non-const situations.

@@ -523,6 +539,11 @@ public static Vector256<uint> Insert(Vector256<uint> value, uint data, byte inde
/// </summary>
public static Vector256<long> Insert(Vector256<long> value, long data, byte index)
{
if (IntPtr.Size != 8)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have some long/ulong helpers work on 32-bit and others not (disabled here since these were the only 64-bit helpers that were "working").

We should have a discussion on whether we:

  • Always support 64-bit overloads (when IsSupported is true)
    • This will require emulation on 32-bit
    • There is one instruction that I know of (movq) that deals with scalars and works on 32-bit
    • The vector intrinsics don't directly deal with 64-bit registers so Vector128<long/ulong> tends to work
  • Sometimes supports 64-bit overloads (depending on the underlying instruction and if it is a helper or not)
    • If we go this route, we should probably come up with a partitioning scheme for telling users these aren't supported on 32-bit
  • Never support 64-bit overloads on 32-bit OS

CC. @eerhardt

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have some long/ulong helpers work on 32-bit and others not

Agree. I think we should disable these two helpers for 32-bit platforms, native compilers also do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with this change, all long/ulong operations will throw PNSE when running in a 32-bit process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt, not quite.

There are two basic types of operations:

  • Those that take/return only Vector128<long> or Vector128<ulong>
    • These are fine in both 32-bit and 64-bit mode, as they operation on 128-bit registers (which exist in both 32-bit and 64-bit modes), on 64-bit memory (Scalar Vector128<T>) , or on 128/256-bit memory (Packed Vector128<T> and Packed Vector256<T>)
  • Those that take/return long or ulong directly, which falls into two-subcategories
    • Those that require a 64-bit register and cannot work on 32-bit
      • Sse2.StoreNonTemporal and Sse2.ConvertToInt64(Vector128<double>) are examples that don't work, as they require at least one 64-bit register
    • Those that can operate directly on 64-bit memory, and don't require a register
      • These sometimes work on 32-bit machines
      • ConvertToInt64(Vector128<long>) is an example that, as it uses a special encoding of the movq instruction
      • ConvertScalarToVector128Double(Vector128<double>, long) is an example of one that supports taking just a 64-bit memory address, but is still not encodable on a 32-bit machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we shouldn't do anything tantamount to emulation when executing on a 32-bit machine. That said, I'm wondering if there are suggestions as to how these limitations/exclusions are reflected in the usage model?

Copy link
Member Author

@tannergooding tannergooding Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm32/64 is differentiating by the Arm and Arm.Arm64 namespaces (however, this is also partially due to these being entirely different architectures, rather than hierarchical, like x86/x64).

Exposing a new class would probably work, but would also break the make Sse2 inherit from Sse proposal since we don't have multiple inheritance or interfaces that can declare static methods.

Duplicating all the members under a separate x64 namespace would also work, but that is a lot of duplicated metadata for relatively few methods.

I had also thought of just updating the reference assembly to have a x86/x64 specific version; but we are exposing all architectures from a single package so the cross-architecture story is nicer (and so you don't need #ifdef)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on an analyzer/documentation to help notify users of these incompatibilities seems like it might the easiest way to do this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
throw new PlatformNotSupportedException();
}

Copy link
Member Author

@tannergooding tannergooding Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with Extract, we should determine if this is "better" than what all three native compilers do, which is:

vpinsrb        ; When index is into the lower 128-bits
vinsertf128

or

vextractf128
vpinsrb        ; When index is into the upper 128-bits
vinsertf128

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment to Extract, the The managed implementation here is just the non-constant fallback.

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@eerhardt
Copy link
Member

eerhardt commented Jul 2, 2018

            buffer[index] = data;

I know this isn't part of this change, but do we want to do a range check on index here? It would seem we could get into a bad situation where we are corrupting the stack if someone passed in index > 4.


Refers to: src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx.cs:552 in c3945e5. [](commit_id = c3945e5, deletion_comment = False)

@eerhardt
Copy link
Member

eerhardt commented Jul 2, 2018

            buffer[index] = data;

Oh, nevermind. As soon as I hit Enter I read the index &= 0x3 line.


In reply to: 401844633 [](ancestors = 401844633)


Refers to: src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx.cs:552 in c3945e5. [](commit_id = c3945e5, deletion_comment = False)

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I only looked at the test changes in the x86/Shared directory - I assume the rest are all auto-generated.

@tannergooding
Copy link
Member Author

Merging to unblock other PRs.

I've logged issues for the two pending discussion points:

@tannergooding tannergooding merged commit e524fbe into dotnet:master Jul 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants