Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware instruction set support for crossgen2 #33274

Merged
merged 48 commits into from
Apr 3, 2020

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Mar 6, 2020

  • Add support for the --instruction-set parameter as described in Allow specifying hardware support level for crossgen2 #226 .
    • NOTE: As the abi for Vector parameters is not yet stable, support for the --instruction-set parameter is only enabled if --inputbubble is also enabled. Parallel work to stabilize the abi is in progress, but is not complete.
  • Add concept of baseline instruction set support to R2R file format
    • Can be applied at a per method level or at the entire R2R file level
  • Refactor how support for hardware intrinsics beyond SSE2 support are handled in crossgen2
    • Add feature to the JIT to detect which hardware features are actually used
    • Tell the JIT unconditionally that SSE42+Lzcnt+Popcnt+Pclmulqdq are supported
    • But if support beyond the --instruction-set specified baseline is used, notate the method with a per-method instruction set support fixup.
    • This enables usage of many intrinsics in corelib with greater efficiency than today
    • This enables usage of SSE42 and below intrinsics safely in non-CoreLib code. Use of higher level intrinsics in non CoreLib code will generate code which does not use the higher level intrinsic, and note that the method's code should not be used in the presence of hardware which does support greater CPU capabilities.
      • In the future a logical enhancement of this work would be to generate multiple bodies of code to handle these more complex cases.
    • In combination with the --instruction-set argument, if Avx2 is enabled, then the logic gracefully adds a dependency on Avx2 capability and Vector<T> becomes useable by crossgen'd code.
      - As a tool to control the behavior of crossgen2 and test some of the behavior around hardware intrinsics, add a command line parameter --compile-aggressive-optimization-methods which changes the default R2R compiler suppression of compiling code marked with the MethodImplOption.AggressiveOptimization.

Work items remaining
[x] Finalize on internal JIT api to query/specify the instruction sets used
[x] Reduce the plethora of different representations of instruction set in the product
[ ] Add ReadyToRunRejectedPrecompiledCode_InstructionSetMismatch flag to ETW Method event.
[ ] Adjust SIMDSupportLevel to be dependent on Sse4.1 instead of 4.2 Also, change handling of intrinsic that generates pcmpgtq instruction to query on support for Sse42 instead of SIMDSupportLevel, so that change is legal to make.
[x] Update JITEEVersionIdentifier
[x] BotR chapter on using hardware intrinsics in AOT code which covers both crossgen1, crossgen2, and runtime behaviors, as well as how the JIT internal logic is supposed to work.

@davidwrighton
Copy link
Member Author

@dotnet/crossgen-contrib This PR is a work in progress, and I haven't actually tested a large swath of code through this. Please review for general structure if you have the time, but I don't expect this to be the exact state of what I check in.
@CarolEidt @tannergooding - Carol/Tanner, could you take a look at how I detect usage of instructions in the Jit and tell me if this is a viable approach. In particular, note the new api compSupportsOptional, which needs to be used extremely carefully. In particular, it can only be used when a false result will not generate code that will not run correctly if the result would have been true at runtime.
@AntonLapounov As discussed, I have a new model for how to make corelib intrinsics use safe instead of the previous, risky strategy of trying to identify a magic set of problematic intrinsics. Please take a look.
@richlander This PR establishes the logic for specifying hardware capabilities to crossgen2. In addition to allowing specification of Avx2 I've reworked how intrinsics can be used in most R2R code (especially SSE42 and below intrinsics). I've left out any ability to control the set of speculatively available instruction sets, as there are complications with making Avx2 speculatively available.

@@ -28,10 +28,12 @@ public static JitConfigProvider Instance
private CorJitFlag[] _jitFlags;
private Dictionary<string, string> _config = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
private object _keepAlive; // Keeps callback delegates alive
private InstructionSetSupport _instructionSetSupport;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JitConfigProvider is a process-wide configuration (due to RyuJIT hosting limitations and no other reason, really). Putting more things in it than strictly necessary ties our hands when we e.g. want to add multi-versioning later.

I would add a InstructionSetSupport GetInstructionSetSupport(MethodWithGCInfo) method on Compilation and have CorInfoImpl call into that when computing the flags in getJitFlags. It would always return the same thing right now, but it will be an obvious point to update if we ever add multi-versioning. I like to leave obvious points like that because it avoids "creative" solutions like adding policies to JitConfigProvider.

(JitConfigProvider stores a bunch CORJIT_FLAGS because I didn't clean it up when we were moving it around in the past few months. I should have, to remove the precedent. It should only store the COMPlus flags for RyuJIT.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Yeah, agreed, I'll admit I ran out of energy for threading this all around the compiler and wanted to see if it all worked.

@@ -0,0 +1,350 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this file to tools\Common\Compiler?


namespace ILCompiler
{
public class InstructionSetSupport : ICompilationRootProvider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is in Common, it will trigger the rule that says "stuff in Common shouldn't reference the ReadyToRun namespace".

Instead of implementing ICompilationRootProvider, could you move the piece of logic in that method to the Compilation constructor where we can just _dependencyGraph.AddRoot it? It will also avoid having to expose NodeFactory on IRootingServiceProvider.

@@ -342,6 +344,30 @@ enum ReadyToRunHelper
READYTORUN_HELPER_StackProbe = 0x111,
};

enum ReadyToRunInstructionSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a note about keeping this in sync (generally speaking) with https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/instr.h#L295?

I'd imagine we wouldn't want to add support for a new ISA in one place and forget about adding it in the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there is a problem here. There are a thicket of different ways in which various closely related portions of the system treat these instruction sets. I currently count 5 different representations (The string one I introduced throughout crossgen2, the ReadyToRunInstructionSet enum, the CORJIT_Flags, the jit internal instruction set enum, and the managed hardware intrinsics classes.) At the moment, I have logic which mostly enforces that the managed hardware intrinsics and the string names are kept in sync, the logic mapping from jit internal instruction sets to strings is setup to produce noway asserts if there isn't a mapping, and the logic mapping string names to CORJIT flags, produces errors if there is no mapping. Thus there is a decent amount of enforcement, but its really messy, and its likely that any actual attempt to add a new instruction set will be really frustrating as a developer finds more and more enums and such to update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fundamentally necessary that these be distinct? Why can't they be shared in some way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All constants and tables used in the ReadyToRun file format are described in the ReadyToRun file format headers. (And nothing else is.) By policy, the common portions of the compiler do not use any ReadyToRun enums or data structures. (Fortunately, the translation between R2R enum and the strings in crossgen is straightforward and transparent.)

More complex is the handling of CORJIT_Flags, and the jit internal instruction set enums. I'd be willing to code up consolidating them on the R2R enum, by removing all of the CORJIT_Flags that are instruction set specific, and either using the R2R enum internally in the JIT, or by making sure the R2R enum values match up with the jit internal enum. (This would involve removing some enum values that have already been defined, under the theory that they cannot be used without feeding them through the rest of the system in any case.) I'd be happy to do this, as the current morass of translations is extremely confusing and risky.

// Answer the question: Is a particular ISA supported?
// Use this api when asking the question so that future
// ISA questions can be asked correctly.
bool compSupportsNoReporting(InstructionSet isa) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one asks "is the ISA supported" without changing any flags, etc
compSupports asks but also marks the ISA as being used
compSupportsOptional asks and marks it as used if it is considered part of the baseline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite.

compSupportsNoReporting asks is the ISA supported without reporting anything to the VM side. Its extremely dangerous to use, and really only valuable for setting up the set of instruction sets useable internally in the jit. Auditing usage of this is quite easy. If the result of this has any direct impact on codegen, it is a bug.

compSupportsOptional asks is the ISA supported. If it returns true, it will report the instruction set usage to the VM, but if false, it does no reporting. This function is only to be used when returning false will not cause the JIT to generate code that will not work correctly if the code is used on a machine which does support the instruction set. Use of this function should be audited with significant care, as incorrect usage of the return value may cause use of illegal instructions or other incorrect behavior at runtime. Unfortunately, I couldn't see a way to usefully do CPU feature usage detection without implementing an api like this, or more significant restructuring of the jit codebase.

compSupports asks is the ISA supported. The fact that the jit asked such a question is recorded and the generated code will not be used on hardware which does not exactly match the result of the call. For instance, if compSupports(Avx) is called, then if it returns false, the generated code will never be executed on hardware which supports Avx. This is a safe api to use at all times. Over-usage of this api may cause unnecessary prevention of functions from being used at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could come up with names that make their usage more clear as to the intended usage 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm notoriously bad at naming (the names make sense to me...), so I'd welcome suggestions, but really I'd rather someone from the JIT team chime in on that, as they are the ones that will have to maintain them going forward, and I want it to make sense to them. @dotnet/jit-contrib

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with @tannergooding - these are extremely confusing, and the idea of having anything that's "extremely dangerous to use" in the JIT just sounds like a bad idea. I think we really need to think this through carefully and figure out how to add the appropriate validation.

This statement is quite confusing with the triple negation: "This function is only to be used when returning false will not cause the JIT to generate code that will not work correctly if the code is used on a machine which does support the instruction set." I had to look through the code before I understood what it meant. AFAICT the only reason for this is to prevent mixing code with different ideas of the size of Vector<T>.

How about:

  • bool compCanUse(InstructionSet isa) - this tells the JIT that it's allowed to use a given ISA, but reports no usage to the VM.
  • bool compSetUsed(InstructionSet isa) - this checks whether the JIT is allowed to use a given ISA and if so, tells the VM that it is doing so.
  • void compDisallow(InstructionSet isa) - this tells the VM not to use this code on a machine that supports the given isa. This would handle the Vector<T> case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting scenario I thought of that this wouldn't cover is if you did a Sse41 check for Math.Round, the JIT said "supported with check", and we didn't emit it (we treat it as "unsupported"); but some other part of the method had its own Sse41.IsSupported check and did emit an Sse41 instruction; then we would miss out on that information (which is that "some path" could have benefited from this).

This wouldn't be an issue if we treat it as "supported", but it is something to consider. Likewise, I think we could potentially miss cases with the other proposal if we only rely on compExactlyDependsOn and compOpportunisticallyDependsOn. Such as if some code path emits an Sse41 instruction without the appropriate check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding - I don't understand how the above captures the scenario where the JIT asked if XXX is supported, the answer was "no", but the code is still valid even if the answer had been "yes" (versus the case where it would be incorrect code for the "yes"/supported case).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt, Could you give a small example of such a scenario in psuedo code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding, @CarolEidt, also from the discussion, I am entirely unconvinced that the compSupports SupportedWithCheck return value is actually useful. It might be, but only as a possible optimization we could use in the future if we find that a true/false result is insufficient.

A scenario where there is a check for an instruction set, and the result was false, but the code should still be used is the Math.Round case, or Vector4.Dot case. More generally, while these cases are somewhat rare, they do exist. In our codebase, they can be seen in the spots where I used compSupportsOptional in my change, or where the code uses a construct like getSIMDSupportLevel() and has a viable fallback path that isn't that inefficient. Remember that R2R code isn't really required to be best in class performance, it is instead intended to be pretty good, and sometimes non-ideal output is ok if it works.

A pseudo code example might look like..

float ComplexMathThing(float a, float b, float c)
{
    return MathF.Round(/*Some very complex expression involving a, b, and c*/ );
}

For all of those sorts of uses, I suspect we'd need to have a compSupportsLike function that doesn't record that an instruction is expected to be used. I suspect it be used in exactly the same locations and way as the compSupportsOptional function that started this whole discussion. Thus I believe the actual impact of following Tanner's model of how to think about the problem is just that we might get a better assert message, and it might be easier to explain the model, but the practical effect is much the same.

I still prefer the approach that I outlined about annotating the questions that are asked, but Tanner's approach has the benefit of allowing us to add an assert to the product that verifies that an instruction set was queried for before it was used by the jit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am entirely unconvinced that the compSupports SupportedWithCheck return value is actually useful

I might be inclined to agree. The initial reasoning around this was from the perspective of S.P.Corelib in that we could trust all code paths were "equivalent", and therefore IsSupported could be a cached CPUID check.
But, if we are saying we can't trust user code and therefore we must reJIT if the method could target AVX and the actual hardware supports AVX, then it isn't important.

It might be impactful to some AOT scenarios, but not without some external switch allowing users to opt into that behavior.

@@ -263,6 +263,38 @@ private void PublishCode()
_methodCodeNode.InitializeDebugVarInfos(_debugVarInfos);
#if READYTORUN
_methodCodeNode.InitializeInliningInfo(_inlinedMethods.ToArray());

// Detect cases where the instruction set support used is a superset of the baseline instruction set specification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this saying it will detect AVX as a superset of SSE2 or it only detects instruction sets that are a superset and share the same encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its more subtle than that.

The default baseline instruction set is Sse, Sse2. However, the jit compiler is told to generate code assuming that it may use Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Lzcnt, Popcnt, and Pclmqdq. But if the compiler actually uses any of Sse3, Ssse3, Sse41, Sse42, Lzcnt, Popcnt, or Pclmqdq it will notify the crossgen2 managed code. That notification feeds into this logic and if so, we will then mark the method such that the generated code can only be used if the actual hardware supports the instruction set. That's all well and good, and considering that just about all hardware does support these instruction sets, we get crossgen R2R to generate ideal code for all these cases.

The less optimal case is where code attempts to use an Avx, or Avx2 feature, or such. In thoses cases, we will generate code as if the processor does NOT support the feature. In corelib, the idea is that we will audit our code reviews such that such a decision does not make the logic wrong (just slow) (This code auditing is already necessary for our current crossgen1 intrinsics support for corelib), but in the general case we cannot assume that the code will behave correctly if it runs on hardware that does support Avx, Avx2, etc. In that case, what this logic does is record the decision that Avx use was attempted and forbidden. Then the generated code will not be used at runtime, and the JIT will be invoked, if the hardware actually supports Avx/Avx2.

In addition, the --instruction-set argument allows adjusting the baseline instruction set. For instance, if the compiler is run with --instruction-set Sse42, --instruction-set Popcnt --instruction-set Lzcnt, --instruction-set Pclmqdq, then this logic will never kick in, and do anything, as the generated code will never exceed that of the baseline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this is merged, I'd really like to see copious documentation of the behavior. It is potentially quite confusing, and we need to be sure that the precise behavior is discoverable and understandable. Also, we should make sure that the JITDUMP indicates the information it has communicated to the VM, and that the log files at execution time include information about when and why methods are rejected for reasons such as this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this needs documentation somewhere, the existing scheme is very nearly undescribed, and this is a substantial difference (and when this feature is complete, I don't expect to change the crossgen1 model, and this more advanced handling of intrinsics and such will remain restricted to crossgen2. I expect I'll put a document in the BoTR about this.

Adding JITDUMP data should be easy.

For logging, what sort of thing are you looking for? That's a feature all by itself, as we don't currently have any reasonable form of logging around R2R load behavior for functions. I could add an event, or possibly a STRESSLOG entry or an addition the the debug only LOG. @CarolEidt @tannergooding I'd like some feedback here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a method SetReadyToRunRejectedPrecompiledCode on PrepareCodeConfig that we already use to note when the precompiled R2R version of a method is not usable because of changes in dependent assemblies. This makes it into the ETL so jit event clients can see how many precompiled methods ended up not being usable.

Presumably you could extend this concept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's convenient. I wasn't aware of that bit, but it should be fairly straightforward to add one bit that indicates that it was dropped due to an instruction set mismatch, and from there looking at the r2rdump output to figure out which exact instruction set was at fault should be straightforward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the SetReadyToRunRejectedPrecompiledCode bit would be set for this condition as well. That bit is not tied to changes in dependent assemblies, its tied to changes in type layout, which are most likely triggered by changes in dependent assemblies, but there are other ways it could be different.

// For ReadyToRun, this list needs to match up with the behavior of FilterNamedIntrinsicMethodAttribs
// In particular, that this list of supported hardware will not generate non-SSE2 safe instruction
// sequences when paired with the behavior in FilterNamedIntrinsicMethodAttribs
// For ReadyToRun we set these hardware features as enabled always, as the overwhelming majority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is what the x86 emulator for Windows 10 on ARM reports (for both my Samsung GalaxyBook 2 and my Microsoft Surface Pro X):
Capture2

Notably, SSE4.2, POPCNT, and LZCNT aren't supported.

  • SSE4.2 and POPCNT were introduced in November 2008, so 11 years old
  • LZCNT is newer than both and is June 2013, so 6 years old
    • AMD introduced LZCNT and POPCNT in 2007 (this was part of their ABM instruction set and uses the same CPUID bits, respectively, that Intel later used)
    • LZCNT, for Intel, is actually newer than AVX (2011) and same time frame as AVX2 (2013)
    • BMI1/BMI2 are the same time frame as AVX2 (2013, Haswell)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh darn, I had thought that SSE4.2 was supported on the emulator, the general strategy here is optimized based on supporting a set that is extremely widely available. As the Arm emulation system only supports SSE4.1, what are your thoughts around changing the SIMDSupportLevel of SIMD_SSE4_Supported to only require SSE4.1? As far as I can tell, the only difference that the jit takes is around support for the pcmpgtq instruction. If we change line 215 in simdcodegenxarch.cpp to check for compSupports(Sse42) instead of using getSIMDSupportLevel() >= SIMD_SSE4_Supported, and changing getSIMDSupportLevel to only require Sse41, I think that would be the only change needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One detail is that supporting a few extra hardware instruction sets isn't problematic, and causes only minimal jit behavior at runtime, but the difference between Sse41 and Sse42 support is very significant as the jit has the concept of simdsupport level which is quite impactful.

Copy link
Member

@tannergooding tannergooding Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be fine, the biggest improvements (with regards to the S.N.Vector types) are likely to come from SSE4.1 anyways, but I would appreciate hearing @CarolEidt's thoughts as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't feel strongly about this, but it seems reasonable to me. One wonders if there are any plans to update the emulator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it interesting that QEMU's CPU emulator has exactly the same restriction to only supporting SSE4.1. I don't intend to put the change into this PR (as it is too large already), but will prepare something separate when this PR is complete.

@dotnet dotnet deleted a comment from azure-pipelines bot Mar 6, 2020
@dotnet dotnet deleted a comment from azure-pipelines bot Mar 6, 2020
@dotnet dotnet deleted a comment from azure-pipelines bot Mar 6, 2020
@CarolEidt
Copy link
Contributor

I think that @AndyAyersMS is going in the right direction with compExactlyDependsOn and compOpportunisticallyDependsOn. To me those don't really imply that they will change the state of the method rather than just a query, but I can't come up with anything better, and I think that they convey the basic difference between the two cases.

@davidwrighton
Copy link
Member Author

@davidwrighton davidwrighton changed the title [WIP] Hardware instruction set support for crossgen2 Hardware instruction set support for crossgen2 Mar 25, 2020
@davidwrighton davidwrighton marked this pull request as ready for review March 25, 2020 01:13
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial feedback -- need more time to look at the rest of the jit changes.

docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved
docs/design/coreclr/botr/vectors-and-intrinsics.md Outdated Show resolved Hide resolved

The JIT receives flags which instruct it on what instruction sets are valid to use at, and the jit interface api s`notifyInstructionSetUsage(isa, bool supportBehaviorRequired)`.

The notifyInstructionSetUsage api is used to notify the AOT compiler infrastructure that the code may only execute if the runtime environment of the code is exactly the same as the boolean parameter indicates it should be. For instance, if `notifyInstructionSetUsage(Avx, false)` is used, then the code generated must not be used if the `Avx` instruction set is useable. Similarly `notifyInstructionSetUsage(Avx, true)` will indicate that the code may only be used if the `Avx` instruction set is available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The notifyInstructionSetUsage api is used to notify the AOT compiler infrastructure that the code may only execute if the runtime environment of the code is exactly the same as the boolean parameter indicates it should be. For instance, if `notifyInstructionSetUsage(Avx, false)` is used, then the code generated must not be used if the `Avx` instruction set is useable. Similarly `notifyInstructionSetUsage(Avx, true)` will indicate that the code may only be used if the `Avx` instruction set is available.
The notifyInstructionSetUsage api is used to notify the AOT compiler infrastructure that the code may only execute if the runtime environment of the code is exactly the same as the boolean parameter indicates it should be. For instance, if `notifyInstructionSetUsage(Avx, false)` is used, then the code generated must not be used if the `Avx` instruction set is available. Similarly `notifyInstructionSetUsage(Avx, true)` will indicate that the code may only be used if the `Avx` instruction set is available.

Most hardware intrinsics support is tied to the use of various Vector apis. There are 4 major api surfaces that are supported by the runtime

- The fixed length float vectors. `Vector2`, `Vector3`, and `Vector4`. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be any of the primitive types, and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer, and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The T type variable may be any of the primitive types

Maybe this could be reworded? It isn't allowed to be char or bool for example. Only the 8 basic integer types (signed and unsigned int8, int16, int32, and int64) and the 2 basic floating-point types (float32 and float64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length and alignment of Vector<T> is unknown to the developer, and Vector<T> may not exist in any interop signature

This isn't quite true, it is known at Runtime by checking Vector<T>.Count which specifies it is a struct of exactly Count T fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tannergooding's comments, and would reword the last phrase (after the comma) as: although they are only hardware accelerated on platforms where the Vector.IsHardwareAccelerated property returns true.


private static int UseAvx2(uint value)
{
// THIS IS A BUG!!!!!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth documenting that this is a bug because PopCount might have been rejitted to know Avx2.IsSupported but UseAvx2 might have not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, is this "ok" to do if UseAvx2 is marked AggressiveOptimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think that this would be ok even if it is marked AggressiveOptimization - I don't think we should ever rely for correctness on the behavior of optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Carol here. At some point we may change the policy on AggressiveOptimization, and that shouldn't break the correctness of applications. We do parse another attribute, the Ssytem.Runtime.BypassReadyToRunAttribute, which can be used to simply declare that a method shouldn't be ready to runned. This attribute isn't actually in the BCL, the application must define it itself, but it can be used to safely forbid a method from being precompiled. I've updated the document, and added a comment that this pattern IS acceptable when applied to a subset of the x86 and x64 instruction sets, as we have special treatment for those instruction sets in the aot compiler when compiling CoreLib.

@tannergooding
Copy link
Member

Reviewed the botr section on vectors-and-intrinsics and overall LGTM 👍

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one more pass.

@CarolEidt please look over this PR again when you get a chance.

bool isaSupported = (opts.compSupportsISA & (1ULL << isa)) != 0;
if ((opts.compSupportsISAReported & isaBit) == 0)
{
notifyInstructionSetUsage(isa, isaSupported);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a jitdump here? Eg

Suggested change
notifyInstructionSetUsage(isa, isaSupported);
JITDUMP("Notifying runtime that codegen for this method is %susable if runtime HW supports %s\n",
isaSupported ? "" : "NOT ", InstructionSetToString(isa));
notifyInstructionSetUsage(isa, isaSupported);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notifyInstructionSetUsage does a jitdump internally. (All calls to the api are dumped)

simdBaseType = TYP_UNKNOWN;
}
#endif // TARGET_XARCH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd that we didn't need this before but need this now -- is it just for the R2R case because you modified largestEnregisterableStructSize ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this only happens in the R2R case, and only because of my trickery in largestEnregisterableStructSize. It has the nice property of making the getBaseTypeAndSizeOfSIMDType safe to call at all times, which I thought was a nice upgrade.

@@ -1550,7 +1550,7 @@ var_types Compiler::impNormStructType(CORINFO_CLASS_HANDLE structHnd, var_types*
{
unsigned originalSize = info.compCompHnd->getClassSize(structHnd);

if ((originalSize >= minSIMDStructBytes()) && (originalSize <= maxSIMDStructBytes()))
if ((originalSize >= minSIMDStructBytes()) && (originalSize <= largestEnregisterableStructSize()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem fairly subtle... would be easy enough to think the upper check should still be maxSIMDStructBytes().

I think we may need to make it clearer why maxSIMDStructBytes and largestEnregisterableStructSize differ and which one should be called when.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this it would make more sense to me to build a helper function which encapsulates the call to minSIMDStructBytes and largestEnregisterableStructSize. Something like structSizeMightRepresentSIMDType(size_t structSize) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that might make the intention clearer.

- Change command line to take the same parameters as the mono aot compiler
- Properly handle references to the Vector128 and Vector256 apis
- Add structSizeMightRepresentSIMDType instead of using largestEnresterableStructSize directly
- Leaves a good spot for a comment
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only reviewed the document so far; will review the code next.

Most hardware intrinsics support is tied to the use of various Vector apis. There are 4 major api surfaces that are supported by the runtime

- The fixed length float vectors. `Vector2`, `Vector3`, and `Vector4`. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be any of the primitive types, and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer, and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tannergooding's comments, and would reword the last phrase (after the comma) as: although they are only hardware accelerated on platforms where the Vector.IsHardwareAccelerated property returns true.


- The fixed length float vectors. `Vector2`, `Vector3`, and `Vector4`. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be any of the primitive types, and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer, and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- `Vector64<T>`, `Vector128<T>`, and `Vector256<T>` represent fixed-sized vectors that closely resemble the fixed- sized vectors available in C++. These structures can be used in any code that runs, but very few features are supported. These vector types define very few direct methods; most operations other than creation use the processor specific hardware intrinsics apis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest reword starting with "very few features" as "very few features are supported directly on these types, other than creation. They are primarily used in the processor specific hardware intrinsic apis."


1. Usage of `Vector2`, `Vector3`, `Vector4`, and `Vector<T>`. For these, its always safe to just use the types. The jit will generate code that is as optimal as it can for the logic, and will do so unconditionally.
2. Usage of `Vector64<T>`, `Vector128<T>`, and `Vector256<T>`. These types may be used unconditionally, but are only truly useful when also using the platform specific hardware intrinsics apis.
3. Usage of platform intrinsics apis. All usage of these apis should be wrapped in an `IsSupported` check of the appropriate kind. Then, within the `IsSupported` check, other platform specific apis may be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing as it tends to gloss over the fact that the developer must ensure that there are IsSupported checks for all apis being used.


Hardware intrinsics have dramatic impacts on codegen, and the codegen of these hardware intrinsics is dependent on the ISA available for the target machine when the code is compiled.

If the code is compiled at runtime by the JIT in a just-in-time manner, then the JIT will generate the best code it can based on the current processor's ISA when compiling tier 1 code, and generate functionally equivalent code when compiling tier 0 code. MethodImplOptions.AggressiveOptimization may be used to bypass compilation of tier 0 code and always produce tier 1 code for the method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tends to imply that the compiler will not recognize the intrinsics when compiling tier 0 code, but that's not the case. The recognition of intrinsics is independent of tier.


If the code is compiled at runtime by the JIT in a just-in-time manner, then the JIT will generate the best code it can based on the current processor's ISA when compiling tier 1 code, and generate functionally equivalent code when compiling tier 0 code. MethodImplOptions.AggressiveOptimization may be used to bypass compilation of tier 0 code and always produce tier 1 code for the method.

For ahead of time compilation, the situation is far more complex. This is due to the following principles of how our ahead of time compilation model works.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use "ahead-of-time" to make it clear that this term is basically an adjective applied to compilation. In fact, I'd suggest to define it early, along with the acronym, and then use AOT in the rest of the document.

#### Characteristics which result from rules
The rules here provide the following characteristics.
- Some platform specific hardware intrinsics can be used in CoreLib without encountering a startup time penalty
- Use of platform specific hardware intrinsics causing runtime jit can be mitigated by careful choice of which intrinsics are used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing, perhaps a little elaboration would be useful.

The rules here provide the following characteristics.
- Some platform specific hardware intrinsics can be used in CoreLib without encountering a startup time penalty
- Use of platform specific hardware intrinsics causing runtime jit can be mitigated by careful choice of which intrinsics are used.
- Use of `Vector<T>` causes runtime jit and startup time concerns. Current analysis indicates this is acceptable, but it is a perennial concern for applications with tight startup time requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after "concerns" add "because it is never precompiled".


private static int UseAvx2(uint value)
{
// THIS IS A BUG!!!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think that this would be ok even if it is marked AggressiveOptimization - I don't think we should ever rely for correctness on the behavior of optimization.

### Code written in other assemblies (both first and third party)

#### Crossgen implementation rules
- Any code which uses an intrinsic out of the `System.Runtime.Intrinsics.Arm` or `System.Runtime.Intrinsics.X86` namespace will not be compiled ahead of time. (See code which throws a TypeLoadException using `IDS_EE_HWINTRINSIC_NGEN_DISALLOWED`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "out of the" => "from the"

- The baseline instruction set which defaults to (Sse, Sse2), but may be adjusted via compiler option.
- The optimistic instruction set which defaults to (Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, and Lzcnt).

Code will be compiled using the optimistic instruction set to drive compilation, but any use of an instruction set beyond the baseline instruction set will be recorded, or any failure of use of an instruction set that would affect semantic behavior of code. If the baseline instruction set includes `Avx2` then the size and characteristics of of `Vector<T>` is known. Any other decisions about ABI may also be encoded. For instance, it is likely that the ABI of `Vector256<T>` will vary based on the presence/absence of `Avx` support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "will be recorded" belongs after the "or any failure" clause.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question and a comment; otherwise I think this is getting close.

case READYTORUN_INSTRUCTION_Popcnt: return InstructionSet_POPCNT;
#endif // TARGET_AMD64
#ifdef TARGET_X86
case READYTORUN_INSTRUCTION_Sse: return InstructionSet_SSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have addressed this in an earlier discussion, but I believe that these need to be the same as the TARGET_AMD64 ones - is there a reason they can't share code here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are autogenerated, and it would add a fair amount of complexity to have the autogenerator write them out once. In this particular part of the code, they don't actually have to be the same (that's restricted to a small bit of logic that is in crossgen2, which have explicit asserts for the details that must remain the same.) At the same time, the way they are authored in the InstructionSetDesc.txt, the autogenerator will happen to keep things the same and as long as new code follows the pattern in there, stuff will remain the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I recall that now. Seems reasonable.

@@ -8095,6 +8087,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
else
{
assert(getSIMDSupportLevel() >= SIMD_SSE2_Supported);
compExactlyDependsOn(InstructionSet_AVX2); // Record that AVX2 isn't supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really feels uncomfortable. Might it be worth adding a compExactlyForbids? I know we've been over this before, but this looks like an assertion that it is present, not an assertion that it is not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't been over this particular usage before, as it might have been obscured by some of the other changes you were looking at. I share your discomfort here, Perhaps

// Ensure that code will not execute if an instruction set is useable. Call only
// if the instruction set has previously reported as unuseable, but when
// that that status has not yet been recorded to the AOT compiler
void compVerifyInstructionSetUnuseable(CORINFO_InstructionSet isa)
{
    // use compExactlyDependsOn to ensure that the exact state of the useability of isa is recorded
    bool isaUseable = compExactlyDependsOn(isa);
    // Assert that the useability of the isa is false. If true, this function should never be called.
    assert(!isaUseable);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwrighton - that sounds like a great alternative; wordy but certainly a context where a little wordiness is probably justified.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just a couple of minor rewording suggestions on the doc.

# Intrinsics apis
Most hardware intrinsics support is tied to the use of various Vector apis. There are 4 major api surfaces that are supported by the runtime

- The fixed length float vectors. `Vector2`, `Vector3`, and `Vector4`. These vector types represent a struct of floats of various lengths. For type layout, ABI and, interop purposes they are represented in exactly the same way as a structure with an appropriate number of floats in it. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be any of the primitive types, and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer, and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations.
- `Vector64<T>`, `Vector128<T>`, and `Vector256<T>` represent fixed-sized vectors that closely resemble the fixed- sized vectors available in C++. These structures can be used in any code that runs, but very few features are supported. These vector types define very few direct methods; most operations other than creation use the processor specific hardware intrinsics apis.
- The variable length `Vector<T>`. This represents vector data of runtime-determined length. In any given process the length of a `Vector<T>` is the same in all methods, but this length may differ between various machines or environment variable settings read at startup of the process. The `T` type variable may be the following types (`System.Byte`, `System.SByte`, `System.Int16`, `System.UInt16`, `System.Int32`, `System.UInt32`, `System.Int64`, `System.UInt64`, `System.Single`, and `System.Double`), and allows use of integer or double data within a vector. The length and alignment of `Vector<T>` is unknown to the developer at compile time (although discoverable at runtime by using the `Vector<T>.Count` api), and `Vector<T>` may not exist in any interop signature. Operations on these vector types are supported on all architectures and platforms, although some architectures may optimize various operations if the `Vector<T>.IsHardwareAccelerated` api.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a "returns true at the end, I think.

- Use of platform specific hardware intrinsics causing runtime jit can be mitigated by careful choice of which intrinsics are used.
- Use of `Vector<T>` causes runtime jit and startup time concerns. Current analysis indicates this is acceptable, but it is a perennial concern for applications with tight startup time requirements.
- Ahead of time generated code which could take advantage of more advanced hardware support experiences a performance penalty until rejitted. (If a customer chooses to disable tiered compilation, then customer code may always run slowly).
- Some use of platform specific hardware intrinsics will force the compiler to be unable to AOT compile the code. However, if care is taken to only use intrinsics from the Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Pclmulqdq, or Lzcnt instruction sets, then the code may be AOT compiled. Preventing AOT compilation may cause a startup time penalty for important scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some use => Some uses?

@davidwrighton davidwrighton merged commit 5ac25ac into dotnet:master Apr 3, 2020
@CarolEidt
Copy link
Contributor

Congratulations @davidwrighton on getting this completed!

@@ -2938,9 +2930,27 @@ private uint getJitFlags(ref CORJIT_FLAGS flags, uint sizeInBytes)
return (uint)sizeof(CORJIT_FLAGS);
}


InstructionSetFlags _actualInstructionSetSupported;
InstructionSetFlags _actualInstructionSetUnsupported;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwrighton I was digging through this over the weekend and this will always have the default value of 0. Should we actually set it somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. Yes. Just below the !supportEnabled path should be setting this.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
@davidwrighton davidwrighton deleted the instruction_set_support branch April 20, 2021 17:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants