Investigate moving the types in `System.Numerics.Vectors` to use Hardware Intrinsics #956

tannergooding · 2018-07-27T17:01:27Z

Today, hardware acceleration of certain types in the System.Numerics.Vectors project is achieved via [Intrinsic] attributes and corresponding runtime support. There are a few downsides to this approach:

Minor tweaks to the backing implementation require shipping a new runtime
It is not obvious that the code has a hardware accelerated path (outside reading documentation)
Many of the types (such as the Matrix types) are not directly hardware accelerated

In netcoreapp30, the new Hardware Intrinsics feature is supposed to ship. This feature also allows hardware acceleration but at a much more fine-grained level (APIs typically have a 1-to-1 mapping with underlying instructions).

We should investigate moving the types in System.Numerics.Vectors to use hardware intrinsics as this has multiple potential benefits:

The hardware acceleration is still tied to the runtime, but minor tweaks can be made without having to ship a new runtime
The code having a hardware accelerated path becomes obvious as does the code that will be generated for a given platform/cpu
It will become much easier to add hardware acceleration support to types currently missing it (such as Matrix4x4)

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-07-27T17:02:58Z

Related to https://github.com/dotnet/corefx/issues/17214

tannergooding · 2018-07-27T17:04:44Z

I had done some basic/preliminary investigation some time back and the numbers looked promising: https://github.com/dotnet/corefx/issues/25386#issuecomment-361137982

tannergooding · 2018-07-27T17:10:17Z

I am marking this up for grabs in case someone from the community wants to work on this. If you want to take a stab, feel free to say so and I can assign the issue out.

The "Microsoft DirectX Math" library is open-source, MIT-licensed, and has existing HWIntrinsic accelerated code for many of these APIs (but for Native code): https://github.com/microsoft/directxmath. This would likely be a good starting point as the algorithms have been around for a long time, have been in active use in production code, and are known to generally be correct and performant.

NOTE: In some cases the C# algorithm and the DirectX algorithm differ slightly, we should make note of these cases and determine the best course of action.

tannergooding · 2018-07-27T17:16:30Z

FYI. @eerhardt, @danmosemsft, @CarolEidt, @fiigii

tannergooding · 2018-07-27T17:17:24Z

Also CC. @benaadams, who I know has looked at vectorizing/improving some of these types previously

danmoseley · 2018-07-27T17:18:18Z

@hughbe in case he is interested in a meaty task.

fiigii · 2018-07-27T18:46:59Z

@tannergooding Thanks for opening this issue. I agree that porting S.N.Vectors to managed code in HW intrinsic is much better to maintain. But before that, we may need to address some differences between Vector128/256<T> and Vector2/3/4/<T>. For example,

We should have an internal API to convert Vector2/3/4/<T> from/to Vector128/256<T>
Vector128/256<T> are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?

fiigii · 2018-07-27T19:26:38Z

The "Microsoft DirectX Math" library is open-source,

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256<T>?

4creators · 2018-07-27T19:56:10Z

IMHO we should take the opportunity when rewriting Vector<T> implementation to:

Use almost complete implementation of Intel HW Intrinsics up to AVX2 to expand Vector API surface with several missing operations which were requested by community or became possible due to HW intrinsics implementation (this would require API proposal, discussion and approval)
Provide Arm64 HW intrinsics support where possible

cc @CarolEidt @danmosemsft @eerhardt @fiigii @tannergooding

tannergooding · 2018-07-27T20:06:00Z

We should have an internal API to convert Vector2/3/4/ from/to Vector128/256

I don't believe this is needed. You should be able to load a Vector2/3/4 into a Vector128<T>, perform all operations, and then convert back to the return type trivially and without much (if any) overhead.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128<T> from a user defined struct is fast/efficient in general.

Vector128/256 are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?

I'm not sure why you think this is an issue. Could you elaborate?

tannergooding · 2018-07-27T20:08:00Z

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256?

There will likely need to be some minimal helper library that operates solely on Vector128<T> that the Vector2/3/4 code would call into. This would also be useful for other places in the framework that could take advantage of this.

It would need to be internal at first, and discussions on making it public could happen later.

tannergooding · 2018-07-27T20:09:48Z

IMHO we should take the opportunity when rewriting Vector implementation to:

Any port should be a two/three step process:

Do a clean port of the existing code and make sure it is equally performant
Do a rewrite of the existing APIs to make improvements, where possible
Add new APIs, as required, and implement accordingly

Doing them together makes it harder to review the code and track the changes.

ARM/ARM64 support should definitely happen, but only after the APIs have been reviewed/approved/implemented.

fiigii · 2018-07-27T20:29:39Z

should be able to load a Vector2/3/4 into a Vector128

That would be great if the runtime can eliminate the round trip. If not, the memory load/store may have considerable overhead when we have __vectorcall in the future.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to https://github.com/dotnet/coreclr/issues/19116

tannergooding · 2018-07-27T20:35:00Z

That would be great if the runtime can eliminate the round trip.

I would certainly hope that we can.

From my view, anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector<T> sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves. So we want to get those types of things flagged, investigated, fixed, etc.

saucecontrol · 2018-08-02T00:01:51Z

anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves

This seems like a good enough excuse for me to take this on if nobody else is interested 😉

I'm at a bit of a loss how to start, though. Even something as trivial as operator + on Vector2 is twisting my brain. As an intrinsic, the JIT knows whether the struct is currently in an xmm register and can emit just a addps. If it's not in a register, it loads it with movsd before emitting the addps. Assuming I pin this and take its address so I can explicitly load it with Sse2.LoadScalarVector128(double*), is the JIT containment logic smart enough to know it's already in a xmm register and skip that part? And would it do the same with Vector3 where the load process is movsd + movss + insertps?

The constructors are even more confusing to me since they may be a setup for further SIMD operations or the fields may be immediately accessed. The JIT seems to know what to do about that now, but I can't imagine how managed code could get that right.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

tannergooding · 2018-08-02T00:25:50Z

@saucecontrol, my initial experimentation can be found here: https://github.com/tannergooding/corefx/commit/f8625409692591fc7bc2ad5ac372f7bdbd724576

, is the JIT containment logic smart enough to know it's already in a xmm register and skip that part

This is a case where, if the JIT isn't smart enough to do this, it probably should be. Most ABIs have some concept of homegenous vector aggregate values (structures that can be packed in one or more vector registers), and they should appropriately handle the transition between memory and register where possible.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

I would expect that you will want some kind of mini "helper" library that only operates on Vector128<float>. You would then implement the various Vector4 (or Vector2/Vector3) operations by:

Loading the Vector2/3/4 arguments into Vector128
Calling into the appropriate helper functions to perform the operation
Converting the Vector128 back into a Vector2/3/4 and returning

I think this is probably desirable as:

Algorithms that operate on Vector128 are generally reusable (including outside System.Numerics)
You don't want to have multiple transitions from Vector2/3 to Vector128 and back (as they are not always a cheap conversion, unless already in register)
You don't want to duplicate the same logic in multiple places (DotProduct, Length, and LengthSquared are all basically the same)
etc

CarolEidt · 2018-08-02T00:26:35Z

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to dotnet/coreclr#19116

dotnet/corefx#19116 deals with vectors wrapped in a class, which is different from wrapping them in a struct. The former is what involves escape analysis or stack allocation, while the other involves appropriate struct promotion and copy elimination. The JIT has room for improvement in all of those areas.

fiigii · 2018-08-02T00:35:03Z

deals with vectors wrapped in a class, which is different from wrapping them in a struct.

Yes, in that issue, I also said

the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

@CarolEidt Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

saucecontrol · 2018-08-02T00:38:29Z

Thanks, @tannergooding

I found my way over to your Vector4 prototype, and I'm relieved to find it matches up pretty closely with my mental model. I'll look into this a bit more in the next few days and see if I can commit to getting it done. Depends how much is already in place as far as tests and benchmarks.

benaadams · 2018-08-02T00:49:31Z

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

Wouldn't it be better to replace the non-intrinsic methods first as that's actually a gain? Or is it to test that the current intrinsic methods match up when using the new intrinsics?

tannergooding · 2018-08-02T00:58:33Z

Or is it to test that the current intrinsic methods match up when using the new intrinsics?

@benaadams, right. I think we want to first make sure there is no loss in existing performance and then work on improving general performance on the rest of the functions after that.

CarolEidt · 2018-08-02T01:25:29Z

Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

I think it would be great to do that; I don't think it rises to the top of my priority list, but I would be supportive of others working on it.

fiigii · 2018-08-02T01:27:57Z

@CarolEidt Thanks! Let me give a try next week (after I finish Avx2.Gather* intrinsic).

tannergooding · 2018-08-02T15:47:09Z

@saucecontrol, thanks! Let me know if you do want to pick this up "more officially" and we can add you as a contributor and assign the issue out 😄.

Note: Being assigned the issue wouldn't mean that you have any obligation to actually complete the work or any kind of deadline for it. It is mostly to communicate that someone is "actively" working on it. If you decide to drop it at a later time, that is completely fine, and we will unassign the issue so that someone else can pick it up.

I would be on point for assisting if you have any questions/concerns or if you just need help (the same goes for anyone else if they decide they want to pick this up instead).

saucecontrol · 2018-08-02T18:39:46Z

@tannergooding, sounds good to me! If you don't mind answering the occasional stupid question on gitter, I should be able to get this done. Let's go ahead and make it official.

tannergooding · 2018-08-02T20:48:12Z

Feel free to ask as many questions as needed 😄

You should have gotten an invitation and the issue has been assigned to you.

saucecontrol · 2018-08-24T10:14:11Z

I started coding on this tonight, and I ran into problems pretty quickly. Take the following test for example

public Vector2 AddemUp()
{
    var sum = Vector2.Zero;
    var vecs = ArrayOfVectors;

    for (int i = 0; i < vecs.Length; i++)
        sum += vecs[i];

    return sum;
}

With the SIMD intrinsics enabled, that for loop compiles to the following:

G_M32824_IG03:
       4C63C2               movsxd   r8, edx
       C4A17B1044C010       vmovsd   xmm0, qword ptr [rax+8*r8+16]
       C4E17B104C2428       vmovsd   xmm1, qword ptr [rsp+28H]
       C4E17058C8           vaddps   xmm1, xmm0
       C4E17B114C2428       vmovsd   qword ptr [rsp+28H], xmm1
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       7FDD                 jg       SHORT G_M32824_IG03

Which is not too bad. Would be better if it didn't spill vsum each time around, but it could be worse 😄

Then I started my prototype simply, with only op_Addition, which I defined thusly:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
    if (Sse2.IsSupported)
    {
        Vector2 vres = default;

        var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
        var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
        Sse2.StoreScalar((double*)&vres.X, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

        return vres;
    }
#endif

    return new Vector2(left.X + right.X, left.Y + right.Y);
}

And that produces the following for the same loop:

G_M32825_IG03:
       C4E17A10442438       vmovss   xmm0, dword ptr [rsp+38H]
       C4E17A11442428       vmovss   dword ptr [rsp+28H], xmm0
       C4E17A1044243C       vmovss   xmm0, dword ptr [rsp+3CH]
       C4E17A1144242C       vmovss   dword ptr [rsp+2CH], xmm0
       4C63C2               movsxd   r8, edx
       4E8D44C010           lea      r8, bword ptr [rax+8*r8+16]
       C4C17A1000           vmovss   xmm0, dword ptr [r8]
       C4E17A11442420       vmovss   dword ptr [rsp+20H], xmm0
       C4C17A104004         vmovss   xmm0, dword ptr [r8+4]
       C4E17A11442424       vmovss   dword ptr [rsp+24H], xmm0
       C4E17857C0           vxorps   xmm0, xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       4C8D442428           lea      r8, bword ptr [rsp+28H]
       C4C17B1000           vmovsd   xmm0, xmmword ptr [r8]
       4C8D442420           lea      r8, bword ptr [rsp+20H]
       C4C17B1008           vmovsd   xmm1, xmmword ptr [r8]
       4C8D442430           lea      r8, bword ptr [rsp+30H]
       C4E17858C1           vaddps   xmm0, xmm0, xmm1
       C4C17B1100           vmovsd   xmmword ptr [r8], xmm0
       4C8B442430           mov      r8, qword ptr [rsp+30H]
       4C89442438           mov      qword ptr [rsp+38H], r8
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       0F8F6BFFFFFF         jg       G_M32825_IG03

My movsds and addps came through nicely, but the stack thrashing is beyond my worst expectations.

I'm wondering whether I've done something wrong. In order to disable the Vector2 Intrinsics, I commented out the following in my copy of the JIT

https://github.com/dotnet/coreclr/blob/59ea9c4e1937ea44faddb31dc3f17f8f1001f1d3/src/jit/simd.cpp#L318-L326

Would that have side effects beyond disabling the specific [Intrinsic] methods? Like would that cause the struct to not be recognized as fitting in a SIMD8 when it otherwise would be enregistered? And if so, is there a better way to disable the Intrinsics while I'm testing?

Also, I'm wondering what will have to be done to get the JIT to recognize the equivalent of __vectorcall for those operator arguments. In my previous HWIntrinsics code, I've resorted to passing my vectors by ref, but we have to keep the existing contracts on the Vector types, so that's not an option here.

CarolEidt · 2018-08-24T14:59:55Z

@saucecontrol - by commenting out the code that recognizes the Vector2 type, the JIT will treat them as regular structs. A better way would be to remove the intrinsics that you are implementing from hwintrinsiclistxarch.h.

You will run into additional issues with Vector2. Because 8 byte structs are passed in a general purpose register on x64/Windows, the JIT doesn't enregister them. This is excessively conservative (it should just move them as/when needed), and is a "TODO" item to fix.

tannergooding · 2018-08-24T15:51:53Z

@CarolEidt, in this case I think it would need to be cut from simdintrinsiclist.h, hwintrinsiclistxarch is for the HWIntrinsics, not for System.Numerics.Vector2

@saucecontrol, you will also get slightly better codegen if you don't initialize vres (which cuts out 22 bytes).

Vector2 vres;

var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
Sse2.StoreScalar((double*)&vres, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

return vres;

which produces:

G_M6507_IG03:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovsd   qword ptr [rsp+18H], xmm0
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+8*r8+16]
       vmovss   xmm0, dword ptr [r8]
       vmovss   dword ptr [rsp+10H], xmm0
       vmovss   xmm0, dword ptr [r8+4]
       vmovss   dword ptr [rsp+14H], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+08H], xmm0
       vmovss   dword ptr [rsp+0CH], xmm0
       lea      r8, bword ptr [rsp+18H]
       vmovsd   xmm0, xmmword ptr [r8]
       lea      r8, bword ptr [rsp+10H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       lea      r8, bword ptr [rsp+08H]
       vmovsd   xmmword ptr [r8], xmm0
       vmovsd   xmm0, qword ptr [rsp+08H]
       vmovsd   qword ptr [rsp+20H], xmm0
       inc      eax
       cmp      ecx, eax
       jg       SHORT G_M6507_IG03

fiigii · 2018-08-27T23:30:00Z

I still would like to say that the current approach is very inefficient (dotnet/corefx#31779 (comment)).

Each S.N.Vector intrinsic in https://github.com/dotnet/corefx/issues/31425#issuecomment-415863096 loads data src, does some simple SIMD computation, and stores the result back memory. That would be fine for micro-benchmarks and Matrix4x4 (that was not vectorized), but I believe that would make obvious regression on scenarios that heavily use Vector3/vector4/Vector<T>.

For example, the code below is generated from the PacketTracer benchmark that use hardware intrinsic directly. You can imagine that memory access would be inserted around (almost) each SIMD instruction if we "rewrite" it in Vector<T> (the rewriting is impossible, just as an example). Furthermore, currently RyuJIT has no memory dependency analysis (like memory SSA), it is impossible to eliminate these loads/stores by JIT.

vaddps ymm2, ymm2, ymmword ptr [r8+0x8]
vmulps ymm2, ymm2, ymm0
vaddps ymm2, ymm2, ymmword ptr [rax]
vmulps ymm2, ymm2, ymm3
vaddps ymm0, ymm2, ymm0
vaddps ymm0, ymm0, ymmword ptr [rcx]
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vpslld ymm1, ymm1, 0x17
vmulps ymm0, ymm0, ymm1
vmulps ymm1, ymm0, ymmword ptr   [rsp+0x600]
vmulps ymm2, ymm0, ymmword ptr   [rsp+0x5e0]
vmulps ymm0, ymm0, ymmword ptr   [rsp+0x5c0]
vmovupd ymmword ptr [rsp+0x260], ymm1 ;;; write back the results after finish as many as possible SIMD computations
vmovupd ymmword ptr [rsp+0x240], ymm2
vmovupd ymmword ptr [rsp+0x220], ymm0

tannergooding · 2018-08-27T23:38:56Z

@fiigii, these are likely things that the JIT will have to fix. Without it, only code in System.Private.CoreLib will be able to remain efficient and any user defined structs will be hit with these inefficiencies. CC. @CarolEidt

fiigii · 2018-08-27T23:44:01Z

these are likely things that the JIT will have to fix.

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128<T>/Vector256<T> between S.N.Vectors (like StaticCast<T, U>)?

tannergooding · 2018-08-27T23:49:56Z

Because not everyone uses S.N.Vectors, some people use their own Vector types (like Unity or Math.NET, etc)

benaadams · 2018-08-27T23:51:20Z

any user defined structs will be hit with these inefficiencies.

There's an issue for it https://github.com/dotnet/coreclr/issues/18542

CarolEidt · 2018-08-28T17:49:33Z

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128/Vector256 between S.N.Vectors (like StaticCast<T, U>)?

First, without a full analysis of where the inefficiency is coming from in this case, it is hard to say how difficult this would be.

On the other hand, the messiness of a conversion intrinsic would not only be a non-trivial amount of work, but would be an unfortunate way to gloss over a fundamental inefficiency.

I think we need to analyze what's the real source of these inefficiencies (there are a number of candidates) and assess the cost of fixing them.

fiigii · 2018-12-18T22:13:13Z

Import @tannergooding's comments from dotnet/coreclr#21518 (comment)

I actually hit this the other day when doing some investigation on dotnet/corefx#31425. The fix to get the SIMD intrinsics to start respecting the [Intrinsic] flag is fairly trivial (and I have it done locally for both the SIMD assembly and CoreLib), but this causes some assembly diffs due to certain methods being treated as intrinsic implicitly.

I will put up a PR after I ensure no codegen diffs (which just requires marking some methods explicitly as intrinsic)
There are also some methods marked as Intrinsic which will never be handled, since we don't have entries in the intrinsic list. This should probably be looked at as well

@tannergooding Which System.Numerics.Vectors methods have to be [intrinsic] (cannot be implemented by HW intrinsic)?

tannergooding · 2018-12-18T22:20:49Z

Which System.Numerics.Vectors methods have to be [intrinsic] (cannot be implemented by HW intrinsic)?

@fiigii, not sure yet. I hit the above as part of trying to remove the [Intrinsic] attribute and add a if (Sse.IsSupported) code path and finding that the JIT was still treating the methods as intrinsic.

fiigii · 2018-12-18T22:23:24Z

JIT was still treating the methods as intrinsic.

Ah, could you try to remove those ones from simdintrinsiclist.h?

tannergooding · 2018-12-18T22:34:39Z

Yes, that works. However, it requires modifying CoreCLR every iteration (longer inner loop) and impacts more than just the singular method a person may be working on (for example, the single Equals entry in simdintrinsiclist.h impacts the Equals method on Vector2, Vector3, Vector4, Vector, and Vector<T>). It also leaves the System.Numerics.Vectors assembly in an odd state where you can't determine what can or cannot be an intrinsic by looking at the source.

I'd rather get the types actually respecting the Intrinsic attribute first and then continue looking at swapping things out a method and type at a time afterwards.

fiigii · 2018-12-18T22:36:32Z

I'd rather get the types actually respecting the Intrinsic attribute first and then continue looking at swapping things out a method and type at a time afterwards.

Agree.

tannergooding · 2020-03-04T14:55:18Z

I had done an initial investigation of this in dotnet/coreclr#27483, but it was closed due to generating more complex trees and therefore missing some optimizations. This is not something that could be readily handled today.

@CarolEidt, @jkotas, @AndyAyersMS.

Unless you have any objections, I plan on closing this as unactionable until the blocking issues are addressed. Do we have issues currently tracking the needed improvements?

As an aside, with us looking at introducing more AOT/R2R options, the ISA specific paths codified in the JIT are becoming more problematic and we are needing to block more methods due to this (e.g. #33090). This also means consumers can't see the benefits of the ISA specific perf improvements when they are available. On the other hand, ISA specific paths codified in managed code don't have this problem and we are able to support both paths relatively trivially. We also have better codegen support in general (especially around containment). and so addressing some of the currently blocking issues may become more necessary as we move forward.

jkotas · 2020-03-04T17:12:33Z

I plan on closing this as unactionable until the blocking issues are addressed

I think that this can be kept open to keep track of the progress. It is something we want to do, it just cannot be done today.

Agree with the rest.

tannergooding · 2021-01-15T17:31:48Z

Going to close this as most off the SIMD intrinsics have already been ported to use HWIntrinsics internally during importation.

jkotas assigned saucecontrol Aug 2, 2018

maryamariyan transferred this issue from dotnet/corefx Dec 16, 2019

Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Dec 16, 2019

maryamariyan added the tenet-performance Performance related issue label Dec 16, 2019

maryamariyan added this to the Future milestone Dec 16, 2019

tannergooding added blocked Issue/PR is blocked on something - see comments and removed untriaged New issue has not been triaged by the area owner labels Mar 5, 2020

This was referenced Mar 27, 2020

Vector.Narrow performance #9766

Closed

Fixing several of the Sse/Sse2.Compare* intrinsics to account for NaN inputs #34204

Merged

tannergooding mentioned this issue Apr 8, 2020

Add BSF and BSR fallbacks for BitOperations methods #34550

Merged

tannergooding mentioned this issue Apr 24, 2020

Adding basic support for recognizing and handling SIMD intrinsics as HW intrinsics #35421

Merged

tannergooding mentioned this issue May 15, 2020

Porting more of the SIMD intrinsics to be implemented as HWIntrinsics #36579

Merged

tannergooding closed this as completed Jan 15, 2021

ghost locked as resolved and limited conversation to collaborators Feb 14, 2021

Investigate moving the types in System.Numerics.Vectors to use Hardware Intrinsics #956

Investigate moving the types in System.Numerics.Vectors to use Hardware Intrinsics #956

Comments

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

danmoseley commented Jul 27, 2018

fiigii commented Jul 27, 2018 • edited Loading

fiigii commented Jul 27, 2018

4creators commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

tannergooding commented Jul 27, 2018

fiigii commented Jul 27, 2018

tannergooding commented Jul 27, 2018

saucecontrol commented Aug 2, 2018

tannergooding commented Aug 2, 2018 • edited Loading

CarolEidt commented Aug 2, 2018

fiigii commented Aug 2, 2018 • edited Loading

saucecontrol commented Aug 2, 2018

benaadams commented Aug 2, 2018

tannergooding commented Aug 2, 2018

CarolEidt commented Aug 2, 2018

fiigii commented Aug 2, 2018

tannergooding commented Aug 2, 2018

saucecontrol commented Aug 2, 2018

tannergooding commented Aug 2, 2018

saucecontrol commented Aug 24, 2018

CarolEidt commented Aug 24, 2018

tannergooding commented Aug 24, 2018

fiigii commented Aug 27, 2018 • edited Loading

tannergooding commented Aug 27, 2018

fiigii commented Aug 27, 2018

tannergooding commented Aug 27, 2018

benaadams commented Aug 27, 2018 • edited Loading

CarolEidt commented Aug 28, 2018

fiigii commented Dec 18, 2018 • edited Loading

tannergooding commented Dec 18, 2018

fiigii commented Dec 18, 2018

tannergooding commented Dec 18, 2018 • edited Loading

fiigii commented Dec 18, 2018

tannergooding commented Mar 4, 2020

jkotas commented Mar 4, 2020

tannergooding commented Jan 15, 2021

Investigate moving the types in `System.Numerics.Vectors` to use Hardware Intrinsics #956

Investigate moving the types in `System.Numerics.Vectors` to use Hardware Intrinsics #956

fiigii commented Jul 27, 2018 •

edited

Loading

tannergooding commented Aug 2, 2018 •

edited

Loading

fiigii commented Aug 2, 2018 •

edited

Loading

fiigii commented Aug 27, 2018 •

edited

Loading

benaadams commented Aug 27, 2018 •

edited

Loading

fiigii commented Dec 18, 2018 •

edited

Loading

tannergooding commented Dec 18, 2018 •

edited

Loading