Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ref T overloads for hardware intrinsic functions which takes T* #36182

Closed
RamType0 opened this issue May 10, 2020 · 20 comments
Closed

Add ref T overloads for hardware intrinsic functions which takes T* #36182

RamType0 opened this issue May 10, 2020 · 20 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics
Milestone

Comments

@RamType0
Copy link

Currently,these are some hardware intrinsic functions which takes T*.
But requiring T* means requires pinning.
Obliviously,hardware intrinsic functions are used for performance,
but pinning will make lower performance.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 10, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@john-h-k
Copy link
Contributor

john-h-k commented May 10, 2020

cc @tannergooding

These were, iirc, explicitly disregarded because they want intrinsics to have to be explicit in nature.

You can workaround it with reference casting, which generates pretty great asm

public Vector128<T> ReadVector128<T>(ref T start) where T : struct
    => Unsafe.As<T, Vector128<T>>(ref start);

You can see this approach being used in System.Buffers.Text.Base64

related convo:

Me:
Is there a way to load vectors through references rather than pointers?

Tanner:
No
by design
Primarily because reading more than a T from a ref is inherently unsafe

@tannergooding tannergooding added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics and removed untriaged New issue has not been triaged by the area owner labels May 10, 2020
@ghost
Copy link

ghost commented May 10, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@tannergooding
Copy link
Member

I thought we had an existing issue for this, but I can't seem to find it right now...

but pinning will make lower performance.

Pinning is relatively low overhead, it effectively just involves storing a value to the stack and zeroing that location when done. While it can have overhead if you are pinning in a tight loop, it is no more expensive than stack spilling in general. Given the most likely scenario with HWIntrinsics is you are working with large amounts of data, pinning your data once upfront should be negligible in cost. Pinning when you have smaller amounts of data may have a larger relative impact, but it would ideally still not be noticeable due to being on small amounts of data. Instead, it would likely only be an issue if you are working with a large number of small inputs.

The downside to pinning is that it can inhibit the GC and indirectly impact performance if a compaction happens. However, ref arithmetic requires more verbose patterns that can impact readability/maintainability in already complex code. Exposing Sse.LoadVector128(ref float value) wouldn't help with that as you would still need to understand that:

  1. ref float value loads 4 floats, not 1
  2. Your inner loop needs to change to use Unsafe.Add<T>(ref T source, int elementOffset) to move to the next entry

This could somewhat be mitigated by taking a ref Vector128<float>, but it is still not ideal and you are more likely to have a float[] as input (otherwise you don't need to use LoadVector128 in the first place, you can just dereference the Vector128) and would require additional Unsafe.As<T, Vector128<T>>(ref T value) casts.

You could mitigate the ref arithmetic issue by exposing Sse.LoadVector128(ref T value, nint elementOffset), but that likely has its own drawbacks as well and doesn't necessarily address that the ref float really must be 4x float (or that ref byte must be 16x byte).

Last time I discussed this with @jkotas and @CarolEidt, it was determined that exposing overloads that took ref was not likely worthwhile at the time. I'm not aware of that stance changing.

@RamType0
Copy link
Author

I also meant System.Runtime.Intrinsics.X86.Avx2.GatherXXX.

Me:
Is there a way to load vectors through references rather than pointers?

Tanner:
No
by design
Primarily because reading more than a T from a ref is inherently unsafe

I think hardware intrinsic is already inherently unsafe.

BTW,Unsafe.As<T,Vector> approach could cause DataMisalignedException on ARM platform.
You should use Unsafe.ReadUnaligned instead.
This approach is used for Span.IndexOf<byte>.
This code is trying to align vector manually.But we don't know when GC happens,and GC breaks alignment,we still need to use Unsafe.ReadUnaligned to load vector.

@GrabYourPitchforks
Copy link
Member

I think hardware intrinsic is already inherently unsafe.

To the best of my knowledge, the only APIs in System.Runtime.Intrinsics that are unsafe are those that explicitly take pointers (and so you need to use an unsafe block to call them) or those that explicitly have Unsafe in the method name (see example). All other APIs should be safe unless I'm missing something.

This is the same reason we don't use ref byte to refer to a buffer in any of our public-facing APIs, with specific exceptions like Unsafe.ReadUnaligned<T>(ref byte).

@RamType0
Copy link
Author

I think hardware intrinsic is already inherently unsafe.

To the best of my knowledge, the only APIs in System.Runtime.Intrinsics that are unsafe are those that explicitly take pointers (and so you need to use an unsafe block to call them) or those that explicitly have Unsafe in the method name (see example). All other APIs should be safe unless I'm missing something.

Yeah,ref T overload could be named Unsafe.

This is the same reason we don't use ref byte to refer to a buffer in any of our public-facing APIs, with specific exceptions like Unsafe.ReadUnaligned<T>(ref byte).

So,why not intrinsic functions become exception?
Most scenarios like using Unsafe class are starving for performance, even if it is unsafe.
These scenarios are very close to scenario where intrinsics are used.

Pinning is relatively low overhead, it effectively just involves storing a value to the stack and zeroing that location when done. While it can have overhead if you are pinning in a tight loop,

If we could avoid pinning in a tight loop,we could use SIMD operations effectively with extern methods.
This restriction will make lower advantages of intrinsics.

@GrabYourPitchforks
Copy link
Member

Can you provide concrete examples (with pseudo-code) of scenarios that would benefit from a ref byte API being added to the intrinsics surface area? We're using the existing intrinsics APIs very effectively today and have not yet seen a need to move away from using pointers.

@tannergooding
Copy link
Member

@GrabYourPitchforks, we use Unsafe.As and ref in a couple places in CoreCLR itself: https://source.dot.net/#System.Memory/System/Buffers/Text/Base64.cs,e2c2e333f97cbd35,references
IIRC, it was done to avoid GC overhead from long term or frequent pinning.

The use-case definitely exists, I think it's just a question of whether the existing support via Unsafe is fine or if exposing explicit overloads that take ref T, int elementIndex is worthwhile.
It's definitely not a lot of work to expose after all as it's basically just approving the APIs and adding them to a table.

@GrabYourPitchforks
Copy link
Member

Right, I'm aware of our internal use cases. My main point was that the existing API surface (even if you have to mix + match API calls) appears to be sufficient for our needs. I can't think offhand of a scenario which is blocked because of missing APIs.

@john-h-k
Copy link
Contributor

@GrabYourPitchforks the one thing i would say is that Unsafe methods don't have the same guarantees on other runtimes or whatever. Whereas you know you are getting the desired instruction with an actual intrinsic

@tannergooding
Copy link
Member

I can't think offhand of a scenario which is blocked because of missing APIs.

Right, there should be nothing blocked by these APIs not existing. It's mostly a question of convenience.

@RamType0
Copy link
Author

I can't think offhand of a scenario which is blocked because of missing APIs.

Right, there should be nothing blocked by these APIs not existing. It's mostly a question of convenience.

I said...

I also meant System.Runtime.Intrinsics.X86.Avx2.GatherXXX.

GatherXXX 's missing ref T overloads are obviously blocking scenario.

This is not replacable with any method,such as Unsafe.ReadUnaligned.

@tannergooding
Copy link
Member

No, but it is replaceable with fixed (T* pData = data) where data is a ref T.

Nothing is blocked, it just isn't as "ideal".

@RamType0
Copy link
Author

No, but it is replaceable with fixed (T* pData = data) where data is a ref T.

Nothing is blocked, it just isn't as "ideal".

So then,I didn't mean question of convenience.

@tannergooding
Copy link
Member

#36323 (comment) is another case where ref overloads may have been beneficial. Essentially, we have an out Matrix4x4 and we need to treat it as 4x Vector128<float>, thus needing to resort to some kind of ref arithmetic.

If we had the ref functions it could have been:

Unsafe.SkipInit(out result);

Sse.StoreVector128(ref result.M11, row1);
Sse.StoreVector128(ref result.M21, row2);
Sse.StoreVector128(ref result.M31, row3);
Sse.StoreVector128(ref result.M41, row4);

@RamType0
Copy link
Author

#36323 (comment) is another case where ref overloads may have been beneficial. Essentially, we have an out Matrix4x4 and we need to treat it as 4x Vector128<float>, thus needing to resort to some kind of ref arithmetic.

If we had the ref functions it could have been:

Unsafe.SkipInit(out result);

Sse.StoreVector128(ref result.M11, row1);
Sse.StoreVector128(ref result.M21, row2);
Sse.StoreVector128(ref result.M31, row3);
Sse.StoreVector128(ref result.M41, row4);

I didnt previously meant convenience,but it is good.

@BruceForstall BruceForstall added this to the Future milestone May 18, 2020
@BruceForstall
Copy link
Member

@tannergooding I marked this as "Future". But should we keep it at all, or should we close it because we're likely to ever implement it?

@tannergooding
Copy link
Member

But should we keep it at all, or should we close it because we're likely to ever implement it?

We've had a few asks on this now so I don't think we should close it outright.

There are scenarios where this would be beneficial and where it might simplify some code, such as the various places we are using Unsafe.As today.
There are also some places where you can't use Unsafe.As and where you must pin, such as the Gather intrinsics. However, I'm also not knowledgeable enough about the GC to know if that is safe given that it reads up to 32 non-sequential addresses using a single input address.

We should probably do some more analysis to determine how frequently this is needed and what scenarios would benefit.

@tannergooding
Copy link
Member

This has been minimally done via the new Vector64/128/256.LoadUnsafe (and StoreUnsafe) APIs.

We could potentially expose other ref overloads for platform specific HWIntrinsics but that would need to be done via a new API proposal following the required template.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

6 participants