-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal: Add Intel hardware intrinsic functions and namespace #23057
Comments
Overall I love this proposal. I do have a few questions/comments:
I think this can be a property, as it is in Does this take the type of T into account? For example, will What about formats that may be supported on some processors, but not others? As an example, lets say there is instruction set X which only supports
My concern here is the target layering for each component. I would hope that Would it be better here to either have the conversion operators on
I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.
For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.
I didn't see any examples of scalar floating-point APIs ( |
Looks good and in line with the suggestions I have made. The only thing that probably do not resonate with me (maybe because we deal with pointers on a regular basis on our codebase) is having to use |
Going to tag @jaredpar here explicitly. We should get a formal proposal up. I think that we can do this without language support (@jaredpar, tell me if I'm crazy here) if the compiler can recognize something like Having a new recognized keyword ( |
Thanks for posting this @fiigii. I'm very eager to hear everyone's thoughts on the design.
One thing that came up in a recent discussion is that some immediate operands have stricter constraints than just "must be constant". The examples given use a
EDIT: This warning is emitted in a VC++ project if you use the above code. Now, this may not be a problem for this particular example (I'm not familiar with its exact semantics), but it's something to keep in mind. There were also other, more esoteric examples given, like an immediate operand which must be a power of two, or which satisfies some other obscure relation to the other operands. These constraints will be much more difficult, most likely impossible, to enforce at the C# level. The "const" enforcement feels more reasonable and achievable, and seems to cover most instances of the problem.
I'll echo what @tannergooding said -- I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls. |
💭 Most of my initial thoughts go to the use of pointers in a few places. Knowing what we know about ref structs and ❓ In the following code, would the generic method actually be expanded to each of the forms allowed by the processor, or would it be defined in coed as a generic?
|
❓ If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions? If we choose the former, would it make sense to rename |
Personally, I am fine with the unsafe code. I don't believe this is meant to be a feature that app designers use and is instead meant to be something framework designers use to squeeze extra performance and also to simplify overhead on the JIT. People using intrinsics are likely already doing a bunch of unsafe things already and this just makes it more explicit.
The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed. I am of the opinion that all of these methods should be declared as This will help ensures the consumer is being aware of the underlying platforms they are targeting and that they are writing code that is "suited" for the underlying hardware (running vectorized algorithms on hardware without vectorization support can cause performance degradation). |
These are the raw CPU platform intrinsics e.g. Assuming the detection is branch eliminated; it should be easy to build a library on top that then does software fallbacks, which can be iterated on (either coreclr/corefx or 3rd party) |
I am not against unsafe code. However, given the choice between safe code and unsafe code that perform the same, I would choose the former.
The biggest advantage of this is the runtime can avoid shipping software fallback code that never needs to execute. The biggest disadvantage of this is test environments for the various possibilities are not easy to come by. Fallbacks provide a functionality safety net in case something gets missed. |
@sharwell, what possibilities are you envisioning? The way these are currently structured, proposed, the user would code: public static double Cos(double x)
{
if (x86.FMA3.IsSupported)
{
// Do FMA3
}
else if (x86.SSE2.IsSupported)
{
// Do SSE2
}
else if (Arm.Neon.IsSupported)
{
// Do ARM
}
else
{
// Do software fallback
}
} Under this, the only way a user is faulted is if they write a bad algorithm or if they forget to provide any kind of software fallback (and an analyzer to detect this should be fairly trivial). |
I would rephrase @tannergooding thought into: "running vectorized algorithms on hardware without vectorization support will with utmost certainty cause performance degradation." |
@tannergooding We defined an individual class for each instruction set (except SSE and SSE2) but put certain small classes into the // Other.cs
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
public static class LZCNT
{
......
}
public static class POPCNT
{
......
}
public static class BMI1
{
.....
}
public static class BMI2
{
......
}
public static class PCLMULQDQ
{
......
}
public static class AES
{
......
}
} |
I don't think this needs to be true all the time. In some cases, the AOT can drop the check altogether, depending on the target operating system (Win8 and above require SSE and SSE2 support, for example). In other cases, the AOT can/should drop the check from each method and should instead aggregate them into a single check at the highest entry point. Ideally, the AOT would run CPUID once during startup and cache the results as globals (honestly, if the AOT didn't do this, I would log a bug). The |
The implication would be from a usage perspective: For Jit we could be quite granular on the checks as they are no-cost branch eliminated. For AOT we'd need to be quite course on the checks and perform it at algorithm or library level, to offset the cost of CPUID; which may push it much higher than intended e.g. you wouldn't use a vectorized IndexOf; unless your strings were huge because CPUID would dominate. Probably could still cache on AOT in startup, so it would set the property; it wouldn't branch eliminate, but would be fairly low cost? |
@tannergooding @mellinoe The current design intent of class public static class SSE
{
// __m128 _mm_add_ps (__m128 a, __m128 b)
public static Vector128<float> Add(Vector128<float> left, Vector128<float> right);
}
public static class SSE2
{
// __m128i _mm_add_epi8 (__m128i a, __m128i b)
public static Vector128<byte> Add(Vector128<byte> left, Vector128<byte> right);
public static Vector128<sbyte> Add(Vector128<sbyte> left, Vector128<sbyte> right);
// __m128i _mm_add_epi16 (__m128i a, __m128i b)
public static Vector128<short> Add(Vector128<short> left, Vector128<short> right);
public static Vector128<ushort> Add(Vector128<ushort> left, Vector128<ushort> right);
// __m128i _mm_add_epi32 (__m128i a, __m128i b)
public static Vector128<int> Add(Vector128<int> left, Vector128<int> right);
public static Vector128<uint> Add(Vector128<uint> left, Vector128<uint> right);
// __m128i _mm_add_epi64 (__m128i a, __m128i b)
public static Vector128<long> Add(Vector128<long> left, Vector128<long> right);
public static Vector128<ulong> Add(Vector128<uint> left, Vector128<ulong> right);
// __m128d _mm_add_pd (__m128d a, __m128d b)
public static Vector128<double> Add(Vector128<double> left, Vector128<double> right);
} Comparing to Although the current design (class SSE2 including SSE and SSE2 intrinsics) hurts API consistency, there is a trade-off between design consistency and user experience, which is worth discussing. |
Rather than |
Very excited we are finally seeing a proposal for this. My initial thoughts below. AVX-512 is missing, probably since it is not that widespread yet, but I think it would be good to at least give this some thought and how to structure these because AVX-512 feature set is very fragmented. In this case I would assume we need to have a class for each set i.e. (see https://en.wikipedia.org/wiki/AVX-512): public static class AVX512F {} // Foundation
public static class AVX512CD {} // Conflict Detection
public static class AVX512ER {} // Exponential and Reciprocal
public static class AVX512PF {} // Prefetch Instructions
public static class AVX512BW {} // Byte and Word
public static class AVX512DQ {} // Doubleword and Quadword
public static class AVX512VL {} // Vector Length
public static class AVX512IFMA {} // Integer Fused Multiply Add (Future)
public static class AVX512VBMI {} // Vector Byte Manipulation Instructions (Future)
public static class AVX5124VNNIW {} // Vector Neural Network Instructions Word variable precision (Future)
public static class AVX5124FMAPS {} // Fused Multiply Accumulation Packed Single precision (Future) and add a Some of these can have a huge impact for deep learning, sorting etc. Regarding [Intrinsic]
public static unsafe Vector256<sbyte> Load(sbyte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadSByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(byte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException(); }
// Etc. The most important thing here is if |
It's great we are discussing a concrete proposal right now. 😄
IntrinsicsNamespace and assemblyI would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.
Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small. Register sizes (currently XMM, YMM, ZMM - 128, 256, 512 bits in x86)Current implementations support limited set of register sizes:
However, ARM recently published: ARM SVE - Scalable Vector Extensions see: The Scalable Vector Extension (SVE), for ARMv8-A published on 31 March 2017 with status Non-Confidential Beta. This specification is quite important as it introduces new register sizes - altogether there are 16 register sizes which are multiples of 128 bits. Details are on page 21 of the specification (table is below).
It would be necessary to design API which is capable to support in near future 16 different register sizes and several thousands (or tens of thousands) of opcodes/functions (counting with overloads). Predictions of not having 2048 bit SIMD instructions in couple of years seems to have been falsified to anyone's surprise by ARM this year. Looking at history (ARM published public beta of ARMv8 ISA on 04 September 2013 and first processor implementing it was available to users globally in October 2014 - Samsung Galaxy Note 4) I would expect that first silicon with SVE extensions will be available in 2018. I suppose this would be most probably very close in time to public availability of DotNet SIMD intrinsics. I would like to propose: VectorsImplement basic Vectors supporting all register sizes in System.CoreLib.Private namespace System.Numerics
{
[StructLayour(LayoutKind.Explicit)]
public unsafe struct Register128
{
[FieldOffset(0)]
public fixed byte [16];
.....
// accessors for other types
}
// ....
[StructLayour(LayoutKind.Explicit)]
public unsafe struct Register2048
{
[FieldOffset(0)]
public fixed byte [256];
.....
// accessors for other types
}
public struct Vector<T, R> where T, R: struct
{
}
public struct Vector128<T> : Vector<T, Register128>
{
}
// ....
public struct Vector2048<T> : Vector<T, Register2048>
{
}
} System.NumericsAll safe APIs would be exposed via Vector and VectorXXX structures and implemented with support of intrinsics. System.IntrinsicsAll vector APIs will use System.Numerics.VectorXXX. public static Vector128<byte> MultiplyHigh<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);
public static Vector128<byte> MultiplyLow<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2); Intrinsics APIs will be placed in separate classes according to functionality detection patterns provided by processors. In case of x86 ISA this would be one to one correspondence between CPUID detection and supported functions. This would allow for easy to understand programming pattern where one would use functions from given group in way consistent with platform support. Main reason for that kind of division is a requirement set by silicon manufacturers to use instructions only if they are detected in hardware. This allows for example to ship processor with support matrix comprising SSE3 but not SSSE3, or comprising PCLMULQDQ and SHA and not AESNI. This direct class - hardware support detection correspondence is the only safe way of having IsHardwareSupported detection and be compliant with Intel/AMD instruction usage restrictions. Otherwise kernel will have to catch for us #UD exception 😸 Mapping APIs to C/C++ intrinsics or to ISA opcodesIntrinsics abstract usually in 1 to 1 way ISA opcodes however there are some intrinsics which map to several instructions. I would prefer to abstract opcodes (using nice names) and implement multi opcode intrinsics as functions on VectorXxx. |
The best place would be System.Numerics.VetorXxx<T> |
Is the platform agnostic |
@jkotas I had the same thought, how do those tie in with Or could we add |
Combining SSE and SSE2 into a single class looks like a good trade-off for a simpler Like @redknightlois and @nietras I'm also concerned about the Load/Store API. [Intrinsic]
public static extern unsafe Vector256<T> Load<T>(void* mem) where T : struct;
[Intrinsic]
public static extern unsafe Vector256<sbyte> Load(sbyte* mem);
[Intrinsic]
public static extern Vector256<sbyte> Load(ref sbyte mem);
[Intrinsic]
public static extern unsafe Vector256<byte> Load(byte* mem);
[Intrinsic]
public static extern Vector256<byte> Load(ref byte mem);
// Etc. Looking forward to using |
@4creators, I am vehemently against moving this feature higher in the hierarchy. For starters, the runtime itself has to support any and all intrinsics (including the strings to identify them, etc) regardless of where we put them in the hierarchy. If the runtime doesn't support them, then you can't use them. I also want to be able to consume these intrinsics from all layers of the stack, including |
I do not think that intrinsics should abstract anything - instead simple add can be created on Vector128 - Vector2048. On the other hand it would be openly against Intel usage recommendations. |
@tannergooding Agree that it has to be available from System.Private.CoreLib However it doesn't mean that it has to be low in hierarchy. No one will ship runtime (vm, gc, jit) which will support all intrinsics for all architectures. Division line goes through ISA plane - x86, Arm, Power. There is no reason to ship ARM intrinsics on x86 runtime. Having it in separate platform assembly in coreclr which could be referenced (circularly) by System.Private.CoreLib could be a solution (I think that a bit better than ifdefing everything) |
@fiigii, why does separating these out mean we lose the generic signature? The way I see it, we have two options:
I see no reason why we can't have That being said, I personally prefer the enforced form that requires additional APIs to be listed. Not only does this help enforce that the user is passing the right things down to the API, but it also decreases the number of checks the JIT must do.
|
I could get behind this. The caveats being that:
|
@jkotas, I think the primary difference is that |
Circular references are non-starter. The existing solution for this problem is to have a subset required by CoreLib in CoreLib as internal, and the full blown (duplicate) implementation in separate assembly. Though, it is questionable whether this duplication in the sake of layering is really worth it. Another thought about naming. The runtime/codegen has many intrinsics today all over the place, for example methods on Should the namespace name be more specific to capture what actually goes into it, say |
@nietras I am working on building the intrinsic API source code into |
Update: replace |
Hi all, the intrinsic API source code has been submitted as PR dotnet/corefx#23489. |
Where are the scalars versions? They are need to implement trigonometric operations for example... |
@fanoI This proposal does not include the hardware instructions that have been covered by current RyuJIT codegen. |
sqrtss and sqrtsd that are for example required to implement Math.Sqrt() are generated by the current RyuJit codegen? I had understood they were part of this proposal too... How you do this otherwise: https://dtosoftware.wordpress.com/2013/01/07/fast-sin-and-cos-functions/ ? |
@fiigii, I thought they were discussed briefly during the API review? In either case, if they are not part of this proposal, I would like to know so I can submit a new proposal covering them. Providing scalar instructions will be required for implementing certain algorithms (such as implementing |
Doesn't RyuJIT already generate It also looks like |
@saucecontrol, RyuJIT treats both However, the remaining |
@fiigii If you think about it, it would actually make sense to eventually get rid of most of what RyuJIT has to do itself. If the surface is covered, you dont need to handle case-by-case with all the complexity inside the JIT code. |
Created a proposal explicitly covering the scalar overloads: https://github.com/dotnet/corefx/issues/23519 |
This API looks really great. There is one thing I would like to see changed however. Could we support a software fallback? I know I'm very late to this discussion, but I'd like to make the case for this. I can see the momentum in this discussion is for not having any software fallback, but I don't see any in-depth discussion of the pros and cons of this. For pros, I see someone mention that any code that runs a would-be software fallback mode would be a performance bug, and it would be better to to crash in the scenario for easier debugging. That's certainly true, but I would argue that this is not a very .NET way of doing things. Many aspects of .NET have graceful performance degradation in place of throwing exceptions and I'm sure it's written somewhere that this is part of the .NET philosophy. Better for code to run slower than to crash outright, unless the programmer specifies this is what they want to happen. This is something I like about the the old Vector API. I think part of the argument for no software fallback is partly based on the fact that the audience for this API is for pretty low level developers who are used to using SIMD extensions from C++ and assembler and whatnot, and having the code crash outright when the real instruction sets are not available is a more comfortable development environment for them. And while I believe this will be true for 98% of developers who use this API, I don't think we should forget the more typical .NET developer and assume they would never want to explore this stuff to see if it could benefit them. In general, I think it's a mistake to design an API like this and assume only a certain type of developer will want to use it. Especially something baked into .NET. Here’s some of the pros I consider a software fallback would provide:
In general, I think a software fallback would provide little if any disadvantage to developers who feel they would not benefit from it, but at the same time make the API much more accessible to regular .NET developers. I don't expect this to change given how late I am to this, but I thought I'd at at least put my thoughts on the record here. |
I agree that having software fallback capability would be nice. However, given that it is just a nice-to-have feature and can also be implemented by individual developers on a need-to-have basis, or as a third-party library, I think it should be placed towards the bottom of the to-do list. I would rather see that energy being directed towards having full AVX-512 support which is already available on server-grade CPUs for a while and on its way to consumer CPUs. |
Ping on AVX512 news? |
We have still some ISAs to implement before already accepted APIs will be finished - some AVX2 intrinsics and whole of AES, BMI1, BMI2, FMA, PCMULQDQ. My expectation is that after this work is finished and implementation is stabilized we will start working on AVX512. However, in the meantime we still have a lot to do with Arm64 implementations. @fiigii could probably provide more info on future plans. |
The current thinking around implementation of Hardware intrinsics is that we provide low level intrinsics which allow assembly like programming plus several helper intrinsics which should make developer life easier. Implementation which provides more abstraction and software fallback is partially available already in The above, however, is a personal view of community member. |
After finish these APIs (e.g., AVX2, FMA, etc.), I think we have to investigate more potential performance issues (e.g., calling conversion, data alignment) before we move to the next step because these issues may blow up with wider SIMD architectures. Meanwhile, I prefer to improve/refactor the JIT backend (emitter, codgen, etc.) implementation before extending it to AVX-512. Yes, we definitely need to extend this plan to AVX-512 in the future, but now it is better to focus on enhancing 128/256-bit intrinsics. |
Personally I don’t see software fallback as worth spending developer effort on, as the consumer can easily implement software feedback themselves if they want to have it, and besides it works better at the algorithm level than having software fallback at the intrinsic level. Actually implementing all the dozens of intrinsics that exist out there for all targeted platforms is not something the consumer can do themselves and so I personally would prefer to have higher priority. Great stuff by the way, I’m very much looking forward to having all these intrinsics available. |
Minor API enhancement suggestion from my side: Add The implementation could be something that looks like The reason for this proposal is - when the generic type argument is known aforehead (eg. concrete type like |
Q. Is there any chance this functionality ever be available in .NET Standard? I maintain a code library/nuget that would benefit from using these hardware intrinsics, but it currently targets .NET Standard to provide good portability. Ideally I'd like to continue to offer portability, but also improved performance if the runtime platform/environment provides these intrinsics. Right now it seems my choice is either speed or portability, but not both - is this likely to change in the future? |
@colgreen This was discussed in https://github.com/dotnet/corefx/issues/24346. I recommend moving the discussion there. |
Where can we find documentation on how to use intrinsics? I can see that the |
@aaronfranke I don't know how much you know already, but it is my understanding that the IntrinsicAttribute is applied to a few methods where the compiler is supposed to replace the usual instructions with specially generated instructions. This is only possible since you distribute IL code to run on different platforms, and if a platform has the popcnt instruction, it is handled as a special case. My example is a not a real one, but you can probably find real examples over at CoreCLR if you search for usages of IntrinsicAttribute. |
https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/ describes good examples of real hardware intrinsic uses. |
If you already have SIMD or low-level programming experience in C/C++, the comment of API source would be sufficient. |
This proposal adds intrinsics that allow programmers to use managed code (C#) to leverage Intel® SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1/2, PCLMULQDQ, and AES instructions.
Rationale and Proposed API
Vector Types
Currently, .NET provides
System.Numerics.Vector<T>
and related intrinsic functions as a cross-platform SIMD interface that automatically matches proper hardware support at JIT-compile time (e.g.Vector<T>
is size of 128-bit on SSE2 machines or 256-bit on AVX2 machines). However, there is no way to simultaneously use different sizeVector<T>
, which limits the flexibility of SIMD intrinsics. For example, on AVX2 machines, XMM registers are not accessible fromVector<T>
, but certain instructions have to work on XMM registers (i.e. SSE4.2). Consequently, this proposal introducesVector128<T>
andVector256<T>
in a new namespaceSystem.Runtime.Intrinsics
This namespace is platform agnostic, and other hardware could provide intrinsics that operate over them. For instance,
Vector128<T>
could be implemented as an abstraction of XMM registers on SSE capable processor or as an abstraction of Q registers on NEON capable processors. Meanwhile, other types may be added in the future to support newer SIMD architectures (i.e. adding 512-bit vector and mask vector types for AVX-512).Intrinsic Functions
The current design of
System.Numerics.Vector
abstracts away the specifics of processor details. While this approach works well in many cases, developers may not be able to take full advantage of the underlying hardware. Intrinsic functions allow developers to access full capability of processors on which their programs run.One of the design goals of intrinsics APIs is to provide one-to-one correspondence to Intel C/C++ intrinsics. That way, programmers already familiar with C/C++ intrinsics can easily leverage their existing skills. Another advantage of this approach is that we leverage the existing body of documentation and sample code written for C/C++ instrinsics.
Intrinsic functions that manipulate
Vector128/256<T>
will be placed in a platform-specific namespaceSystem.Runtime.Intrinsics.X86
. Intrinsic APIs will be separated to several static classes based-on the instruction sets they belong to.Some of intrinsics benefit from C# generic and get simpler APIs:
Each instruction set class contains an
IsSupported
property which stands for whether the underlying hardware supports the instruction set. Programmers use these properties to ensure that their code can run on any hardware via platform-specific code path. For JIT compilation, the results of capability checking are JIT time constants, so dead code path for the current platform will be eliminated by JIT compiler (conditional constant propagation). For AOT compilation, compiler/runtime executes the CPUID checking to identify corresponding instruction sets. Additionally, the intrinsics do not provide software fallback and calling the intrinsics on machines that has no corresponding instruction sets will causePlatformNotSupportedException
at runtime. Consequently, we always recommend developers to provide software fallback to remain the program portable. Common pattern of platform-specific code path and software fallback looks like below.The scope of this API proposal is not limited to SIMD (vector) intrinsics, but also includes scalar intrinsics that operate over scalar types (e.g. int, short, long, or float, etc.) from the instruction sets mentioned above. As an example, the following code segment shows
Crc32
intrinsic functions fromSse42
class.Intended Audience
The intrinsics APIs bring the power and flexibility of accessing hardware instructions directly from C# programs. However, this power and flexibility means that developers have to be cognizant of how these APIs are used. In addition to ensuring that their program logic is correct, developers must also ensure that the use of underlying intrinsic APIs are valid in the context of their operations.
For example, developers who use certain hardware intrinsics should be aware of their data alignment requirements. Both aligned and unaligned memory load and store intrinsics are provided, and if aligned loads and stores are desired, developers must ensure that the data are aligned appropriately. The following code snippet shows the different flavors of load and store intrinsics proposed:
IMM Operands
Most of the intrinsics can be directly ported to C# from C/C++, but certain instructions that require immediate parameters (i.e. imm8) as operands deserve additional consideration, such as
pshufd
,vcmpps
, etc. C/C++ compilers specially treat these intrinsics which throw compile-time errors when non-constant values are passed into immediate parameters. Therefore, CoreCLR also requires the immediate argument guard from C# compiler. We suggest an addition of a new "compiler feature" into Roslyn which placesconst
constraint on function parameters. Roslyn could then ensure that these functions are invoked with "literal" values on theconst
formal parameters.Semantics and Usage
The semantic is straightforward if users are already familiar with Intel C/C++ intrinsics. Existing SIMD programs and algorithms that are implemented in C/C++ can be directly ported to C#. Moreover, compared to
System.Numerics.Vector<T>
, these intrinsics leverage the whole power of Intel SIMD instructions and do not depend on other modules (e.g.Unsafe
) in high-performance environments.For example, SoA (structure of array) is a more efficient pattern than AoS (array of structure) in SIMD programming. However, it requires dense
shuffle
sequences to convert data source (usually stored in AoS format), which is not provided byVector<T>
. UsingVector256<T>
with AVX shuffle instructions (including shuffle, insert, extract, etc.) can lead to higher throughput.Furthermore, conditional code is enabled in vectorized programs. Conditional path is ubiquitous in scalar programs (
if-else
), but it requires specific SIMD instructions in vectorized programs, such as compare, blend, or andnot, etc.As previously stated, traditional scalar algorithms can be accelerated as well. For example, CRC32 is natively supported on SSE4.2 CPUs.
Implementation Roadmap
Implementing all the intrinsics in JIT is a large-scale and long-term project, so the current plan is to initially implement a subset of them with unit tests, code quality test, and benchmarks.
The first step in the implementation would involve infrastructure related items. This step would involve wiring the basic components, including but not limited to internal data representations of
Vector128<T>
andVector256<T>
, intrinsics recognition, hardware support checking, and external support from Roslyn/CoreFX. Next steps would involve implementing subsets of intrinsics in classes representing different instruction sets.Complete API Design
Add Intel hardware intrinsic APIs to CoreFX dotnet/corefx#23489
Add Intel hardware intrinsic API implementation to mscorlib dotnet/corefx#13576
Update
08/17/2017
System.Runtime.CompilerServices.Intrinsics
toSystem.Runtime.Intrinsics
andSystem.Runtime.CompilerServices.Intrinsics.X86
toSystem.Runtime.Intrinsics.X86
.Avx
instead ofAVX
.address
instead ofmem
.IsSupport
as properties.Span<T>
overloads to the most common memory-access intrinsics (Load
,Store
,Broadcast
), but leave other alignment-aware or performance-sensitive intrinsics with original pointer version.Sse2
class design and separate small calsses (e.g.,Aes
,Lzcnt
, etc.) into individual source files (e.g.,Aes.cs
,Lzcnt.cs
, etc.).CompareVector*
toCompare
and get rid ofCompare
prefix fromFloatComparisonMode
.08/22/2017
Span<T>
overloads byref T
overloads.09/01/2017
12/21/2018
The text was updated successfully, but these errors were encountered: