Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Implement "IsSupported" for all ISA classes of Intel hardware intrinsics #14020

Merged
merged 2 commits into from
Sep 29, 2017

Conversation

fiigii
Copy link

@fiigii fiigii commented Sep 15, 2017

This PR is implementing IsSupported properties for all ISA classes in System.Runtime.Intrinsics.X86 under JIT compilation (not work with AOT).

Close #13982

@fiigii fiigii changed the title [WIP] Implement "IsSuppoored" for all ISA classes of Intel hardware intrinsics [WIP] Implement "IsSupported" for all ISA classes of Intel hardware intrinsics Sep 15, 2017
@fiigii
Copy link
Author

fiigii commented Sep 15, 2017

@@ -6,7 +6,7 @@ namespace System.Runtime.CompilerServices
{
// Calls to methods or references to fields marked with this attribute may be replaced at
// some call sites with jit intrinsic expansions.
[AttributeUsage(AttributeTargets.Method | AttributeTargets.Constructor | AttributeTargets.Field, Inherited = false)]
[AttributeUsage(AttributeTargets.Method | AttributeTargets.Constructor | AttributeTargets.Field | AttributeTargets.Property, Inherited = false)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make sense on property. It has to be on the method implementing the property, like:

public static bool IsSupported
{
    [Intrinsic]
    get { return false; }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

@@ -13,6 +14,7 @@ namespace System.Runtime.Intrinsics.X86
[CLSCompliant(false)]
public static class Aes
{
[Intrinsic]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since every method in this namespace will be marked as intrinsic eventually, I am wondering whether we should rather special case System.Runtime.Intrinsics.* (here)[https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5086] and set the flag for all methods under this namespaces.

https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5088

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rather special case System.Runtime.Intrinsics.* ... and set the flag for all methods under this namespaces.

Doesn't this reduce intrinsic recognition overhead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, each [Intrinsic] adds size to System.Private.CoreLib. It is hard to tell what is faster, and it is not that important because the result is cached. Special casing the namespace will be definitely smaller (rought estimate 5-10 kB of disk footprint).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could we change IntrinsicsAttribute to be able to apply on Class and Struct (or larger scopes)? Then specially treat System.Runtime.Intrinsics.*?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot apply the attribute on large scopes. It would have to be on each class or struct. It does not look like a particularly great option to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not look like a particularly great option to me.

How about considering this optimization later (in a new issue)? Or you prefer to address it in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it now, it would save you from editing like dozen of .cs files. But if you prefer to do it separately - it is fine with me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it now, it would save you from editing like dozen of .cs files.

Good point 😄. But let's consider the "optional expand" issue at first, that is more important.


// Instruction set flags fo// Instruction set flags for Intel hardware intrinsics
#ifdef _TARGET_XARCH_
opts.compSupportsAES = false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a single "flags" variable should be used instead of all these variables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will change to use a 64-bit flag variable.

InstructionSet lookupHWIntrinsicISA(const char* className);
NamedIntrinsic lookupHWIntrinsic(const char* methodName, InstructionSet isa);
InstructionSet isaOfHWIntrinsic(NamedIntrinsic intrinsic);
GenTreePtr impX86HWIntrinsic(NamedIntrinsic intrinsic, CORINFO_METHOD_HANDLE method);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenTree* instead of GenTreePtr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do. Shall we always use GenTree* instead of GenTreePtr?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested this in other PRs and it seems that there is universal agreement about using GenTree* instead of GenTreePtr.

There is no GenTreeCastPtr, no BasicBlockPtr, no CompilerPtr. It's just GenTreePtr, an oddball.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

return result;
}

#ifdef _TARGET_XARCH_
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could all this new code be moved to a separate file? importerxarch.cpp perhaps? This file is already large as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will separate this new code from importer.cpp to a new file. How about hwintrinsicxarch.pp?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about hwintrinsicxarch.pp?

I assume you mean .cpp. Sounds reasonable to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 😄

@@ -2261,7 +2261,7 @@ void LinearScan::TreeNodeInfoInitSIMD(GenTreeSIMD* simdTree)
{
bool needFloatTemp;
if (varTypeIsSmallInt(simdTree->gtSIMDBaseType) &&
(compiler->getSIMDInstructionSet() == InstructionSet_AVX))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this change part of a different PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #13879

@fiigii
Copy link
Author

fiigii commented Sep 15, 2017

@mellinoe @jkotas Tests failed on

The type or namespace name 'Intrinsics' does not exist in the namespace 'System.Runtime' (are you missing an assembly reference?) [D:\j\workspace\debug_windows---17180f1b\tests\src\JIT\HardwareIntrinsics\IsSupported.csproj

Should I set the assembly reference somewhere?

@mellinoe
Copy link

mellinoe commented Sep 15, 2017

@fiigii That is a good question. The test dependencies are defined here (I think -- it might be in the test_runtime.csproj next to it). It's been a little while since I changed that stuff in coreclr.

But currently, System.Runtime.Intrinsics.x86 is actually not being packaged up at all in corefx. So there's nothing for us to reference to get it here. We probably need to create a standalone package containing the assembly. I can work on that.

@fiigii
Copy link
Author

fiigii commented Sep 15, 2017

We probably need to create a standalone package containing the assembly. I can work on that.

Thank you for the help 😄

@fiigii
Copy link
Author

fiigii commented Sep 20, 2017

Update

  • Moved hardware intrinsic related code to a new file "hwintrinsicxarch.cpp".
  • Removed [Intrinsics] from hardware intrinsic definition, and specially treat System.Runtime.Intrinsics.X86 namespace in intrinsic recognition.
  • Changed ISA variables to a flag variable.
  • Hardcoded the version of System.Runtime.Intrinsics.X86 package to resolve the building error. (Any better solution?).
  • Revoked changes from [RyuJIT] Correct names of AVX that stand for AVX2 #13879. I carefully retain the name of InstructionSet_AVX in legacy SIMD code. I will address the codegen conflict in #14065 rather than just name conflict.

@fiigii
Copy link
Author

fiigii commented Sep 20, 2017

cc @jkotas @mikedn @russellhadley PTAL

@fiigii
Copy link
Author

fiigii commented Sep 20, 2017

These intrinsics cannot be expanded under debug build. @jkotas , do you have any suggestion to address the "optional expansion" issue?

LPCUTF8 className;
LPCUTF8 nameSpace;

HRESULT hrns = GetMDImport()->GetNameOfTypeDef(bmtInternal->pType->GetTypeDefToken(), &className, &nameSpace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to do this check once per type, not once per each method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it can be under defined(_TARGET_X86_) || defined(_TARGET_AMD64_) ifdef.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do.

@jkotas
Copy link
Member

jkotas commented Sep 21, 2017

My proposal would be:

  • Implement the instrinsics by recursively calling itself, e.g. public static Vector256<float> Add(Vector256<float> left, Vector256<float> right) => Add(left, right).
  • When the JIT sees intrinsic recursively calling itself, it should force expand it.

This plan will:

  • Allows the JIT to no expand the intrinsics with optimizations off. This is important for both throughput and resiliency against bugs (when there is a fatal error during JITing, the JIT tried to compile the method again with optimizations off).
  • Makes the intrinsics work even when the optimizations are off (they will be calls).
  • Makes the intrinsics callable by reflection. This is important for parts of the system that interpret the IL.

@jkotas
Copy link
Member

jkotas commented Sep 21, 2017

Also, we will need to disable saving of compiled intrinsic method bodies into NGen images. The IsSupported intrinsics at least. There needs to be a fresh body generated at runtime based on the actual processor capabilities.

@fiigii
Copy link
Author

fiigii commented Sep 21, 2017

When the JIT sees intrinsic recursively calling itself, it should force expand it.

This solution makes sense to me, but it will complicate the intrinsic implementation for other hardware platforms. Originally, I planned to let x86 intrinsics back to C# code on ARM, which always returns false from IsSupported and throws exceptions from others.

There needs to be a fresh body generated at runtime based on the actual processor capabilities.

Don't you mean that we can compile intrinsics and insert into AOT-compiled images at runtime?

@jkotas
Copy link
Member

jkotas commented Sep 21, 2017

I planned to let x86 intrinsics back to C# code on ARM

You can still do that by having two files with the intrinsics implementations - one throwing (included in non-xarch CoreLib builds) and one recursive (included in xarch CoreLib builds).

Don't you mean that we can compile intrinsics and insert into AOT-compiled images at runtime?

Yes, you can say it that way.

@fiigii
Copy link
Author

fiigii commented Sep 21, 2017

You can still do that by having two files with the intrinsics implementations - one throwing (included in non-xarch CoreLib builds) and one recursive (included in xarch CoreLib builds).

This solution sounds pretty cool 😎 , will do.

Yes, you can say it that way.

This AOT solution would be perfect if the start time regression is acceptable. I want to implement all the intrinsics in JIT at first and enable them for AOT later (the AOT work may need your help 😄 ).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this approach seems reasonable.
I have two main concerns:

  • It appears that this breaks existing code that checks the current instruction set, and
  • This increases the cost of setting the instruction set. That may not be a significant issue, but as we add intrinsics and increase the string comparisons we are doing to identify them, I'm concerned about the impact of JIT throughput. Could you please measure before & after?

src/jit/instr.h Outdated
// Linear order end
// Instruction sets have the linear order only in above area.
// Values of InstructionSet not in this area cannot be compared
// (i.e. compiler->getSIMDInstructionSet() >= InstructionSet_SSE3_4).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the values can no longer be compared at all. Indeed, this change would seem to break the existing code that does comparisons, since if any of the higher bits are set, it will be true that compiler->getSIMDInstructionSet() >= InstructionSet_AVX2, which could lead to problems. I don't think you can make this change without changing all the code that depends on the values of compiler->getSIMDInstructionSet() being in linear order.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the values can no longer be compared at all.

I agree, and this is a temporary solution. Actually, only SIMD intrinsic (System.Numerics.Vectors) code is comparing instruction set values, and the new hardware intrinsic implementation never depends upon the order. In this PR, SIMD still works well because getSIMDInstructionSet only returns SSE2, SSE3_4, or AVX. I will refactor this "ISA comparison" in https://github.com/dotnet/coreclr/issues/14065

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and this is a temporary solution.

Before merging this, then, you need to add a comment indicating what the longer term solution is, and how it will interplay with getFloatingPointInstructionSet() (and getSIMDInstructionSet()).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will do.

}
if (Aes.IsSupported && Bmi1.IsSupported && Bmi2.IsSupported && Fma.IsSupported &&
Lzcnt.IsSupported && Popcnt.IsSupported && Pclmulqdq.IsSupported)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is off here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

else
{
result = false;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is difficult to do any real verification of the result here, it might be useful to also add a test that the size of, say, Vector is 128, unless Avx2.IsSupported, in which case it should be 256.

@fiigii
Copy link
Author

fiigii commented Sep 23, 2017

It appears that this breaks existing code that checks the current instruction set, and

@CarolEidt No, it does not. Currently, the new hardware intrinsic implementation does not overlap with the existing SIMD code at all. There are indeed some naming and potential codegen conflicts that we discussed in #13879, but do not impact the current codegen. I will solve all the conflicts in https://github.com/dotnet/coreclr/issues/14065 .

This increases the cost of setting the instruction set. That may not be a significant issue, but as we add intrinsics and increase the string comparisons we are doing to identify them, I'm concerned about the impact of JIT throughput. Could you please measure before & after?

I think we are discussing this in https://github.com/dotnet/corefx/issues/23519

@AndyAyersMS
Copy link
Member

It would be nice to expand the tabular scheme you have introduced for the hardware intrinsics to include the non-hardware ones too (right now there's just the one, but there are two more in pending PRs, and likely more to come).

In my opinion this can wait for a subsequent change where we perhaps also auto-generate the recognizer. But we should do this before opening the floodgates on the hardware intrinsic set.

I wonder what the best scheme is for handling prejitted code.

The basic option is to have the target-relevant IsSupported checks to remain calls and skip prejitting for those methods. I don't know if we're going to allow mixed IAS target intrinsics in method bodies (seems like we could but might not want to) -- if so then when prejitting the off-target IsSupporteds must evaluate to false.

We could go beyond that as follows.

We could fail prejitting (or the equivalent) when one of these intrinsics is seen in the importer either as a root method or a callee. Hence any method that contains a target-appropriate IsSupported call will also get jitted and potentially be tailored to the target ISA. This has the drawback that if A calls B and B is an inline candidate with an IsSupported call, we might prejit A, and B will not end up getting inlined.

A third option is for the jit to remember if it saw any of these intrinsics during importation for a method during prejit codegen, and report this back to the VM at the end of jittting, or perhaps fail but only after inlining once all the intrinsics have been seen (or let the failure to import B as an inlinee cascade up and cause A to fail too). Then neither A or B is prejitted and at jit-time B can get inlined into A.

@jkotas
Copy link
Member

jkotas commented Sep 23, 2017

We could fail prejitting (or the equivalent) when one of these intrinsics is seen in the importer either as a root method or a callee

I think this is what we should do to start. It is what's done for the existing Vector<T> SIMD methods as well.

@fiigii fiigii changed the title [WIP] Implement "IsSupported" for all ISA classes of Intel hardware intrinsics Implement "IsSupported" for all ISA classes of Intel hardware intrinsics Sep 27, 2017
@fiigii
Copy link
Author

fiigii commented Sep 27, 2017

@jkotas @CarolEidt PTAL

@fiigii
Copy link
Author

fiigii commented Sep 27, 2017

@jkotas does the latest change look good to you? That passed the tests on debug and release build both.

LPCUTF8 x86className;
LPCUTF8 x86nameSpace;
isX86IntrinsicType = false;
HRESULT hrns = GetMDImport()->GetNameOfTypeDef(bmtInternal->pType->GetTypeDefToken(), &x86className, &x86nameSpace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically just use hr in situations like this

@@ -2956,6 +2956,11 @@ class MethodTableBuilder
, AllocMemTracker *pamTracker
);

#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
private:
bool isX86IntrinsicType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called isHardwareIntrinsic and it should not be under ifdef - we will want the same bool for other platform intrinsics, and it should be in say bmtProperties to follow the style of this type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be in say bmtProperties to follow the style of this type.

Don't you mean that isHardwareIntrinsic should be a field of bmtProperties struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - it is what I meant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the JIT changes.
I would still like to see some cross-validation of Vector size and the value returned for Avx2.IsSupported(), but I won't insist on it.

@@ -3324,7 +3324,7 @@ GenTreePtr Compiler::impIntrinsic(GenTreePtr newobjThis,
//
// We disable the inlining of instrinsics for MinOpts.
//
if (!mustExpand && (opts.compDbgCode || opts.MinOpts()))
if (!mustExpand && (opts.compDbgCode || opts.MinOpts()) && !gtIsRecursiveCall(method))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment explaining the purpose of this extra check.

Also, while you're in the neighborhood, can you fix the misspelled word in the existing comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will do.

@fiigii fiigii force-pushed the issupported branch 2 times, most recently from 9c09635 to 79f13b1 Compare September 28, 2017 05:44
@fiigii
Copy link
Author

fiigii commented Sep 28, 2017

Addressed all the feedback, thank you all 🎓.

@@ -5090,10 +5111,15 @@ MethodTableBuilder::InitNewMethodDesc(
NULL,
NULL);

if (hr == S_OK)
if (hr == S_OK
#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please delete this ifdef here? It is not necessary and it will be become a problem for enabling the HW intrinsics on more platforms later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
LPCUTF8 x86className;
LPCUTF8 x86nameSpace;
bmtProp->fIsHardwareIntrinsic = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary - all bmtProp fields are zero initialized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#if defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
// All the funtions in System.Runtime.Intrinsics.X86 are hardware intrinsics.
// We specially treat them here to reduce the disk footprint of mscorlib.
if (GetModule()->IsSystem())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, would it possible to more this check to MethodTableBuilder::CheckForSystemTypes ? We do fetch the name there already - it is better to do it just once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes more sense, will try.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas In BuildMethodTableThrowing, CheckForSystemTypes is called after AllocAndInitMethodDescs(); that sets every MethodDesc to isJitIntrinsic or not. Would it possible to chenge these call sites order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, reordering these methods would likely require more elaborate refactoring - not something to do as part of this change.

I think it is ok to leave it as is. Maybe reduce the number of cases where we fetch the name by adding && !bmtGenerics->HasInstantiation() condition in addition to the IsSystem condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -22,6 +22,9 @@
<PackageReference Include="System.Security.Permissions">
<Version>$(CoreFxPackageVersion)</Version>
</PackageReference>
<PackageReference Include="System.Runtime.Intrinsics.X86">
<Version>4.5.0-preview1-25718-03</Version>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mellinoe can I set a more flexible package version here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use this like the one above:

<Version>$(CoreFxPackageVersion)</Version>

That property will get automatically updated every so often and the package will get upgraded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last week this property did not work, but today it works fine... Thank you!

@fiigii
Copy link
Author

fiigii commented Sep 28, 2017

@jkotas could you please merge this PR if it looks good?

@jkotas
Copy link
Member

jkotas commented Sep 28, 2017

I will merge once the CI is green. Thank you!

@fiigii
Copy link
Author

fiigii commented Sep 28, 2017

Thanks!

@fiigii
Copy link
Author

fiigii commented Sep 29, 2017

@dotnet-bot test Windows_NT x86 Checked Build and Test

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants