-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid large table as part of TensorPrimitives remainder mask #92765
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsAs discussed in #92672 (review)
|
...ibraries/System.Numerics.Tensors/src/System/Numerics/Tensors/TensorPrimitives.netstandard.cs
Show resolved
Hide resolved
Vector512.LoadUnsafe( | ||
ref Unsafe.As<uint, float>(ref MemoryMarshal.GetReference(RemainderUInt32Mask_16x16)), | ||
(uint)(validItems * 16)); // all sixteen floats in the row | ||
Vector512.ConditionalSelect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for AVX512 it'd looked like
var mmask16 mask = (1 << n) - 1;
var maskedAnd = Avx512.And(vec1, vec2, mask);
but we decided not to expose mask registers so if we expose these as a public API, we can intrinsify at leas the AVX512 version to be cheap 🙂 if it matters, presumably, performance of handling of trailing elements is not that much important esp for large data.
Vector128.ConditionalSelect( | ||
Vector128.LessThan(Vector128.Create(3, 2, 1, 0), Vector128.Create(validItems)).AsSingle(), | ||
Vector128<float>.AllBitsSet, | ||
Vector128<float>.Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be slower and result in a 128-bit constant (or more for other sizes) emitted at runtime per method this gets inlined into.
I think the table is ultimately better and we can get rid of it longer term using a proper JIT intrinsic.
If we really don't want the table, then achieving this with just a broadcast + comparison should be sufficient, since LessThan
already produces a per-element mask of AllBitsSet
(true) and Zero
(false). You can use GreaterThanOrEqual
if you need the mask inverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll stick with the table. I was on the fence anyway. Just don't like its bulkiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, since TrailingMask
wants to handle only trailing elements, it will skip processing the first n
items that have already been processed.
So if 2
items remain, you have 3 < 2, 2 < 2, 1 < 2, 0 < 2
which produces Zero, Zero, AllBitsSet, AllBitsSet
already
As discussed in #92672 (review)