Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid large table as part of TensorPrimitives remainder mask #92765

Closed
wants to merge 1 commit into from

Conversation

stephentoub
Copy link
Member

As discussed in #92672 (review)

@ghost
Copy link

ghost commented Sep 28, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

As discussed in #92672 (review)

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Numerics

Milestone: -

Vector512.LoadUnsafe(
ref Unsafe.As<uint, float>(ref MemoryMarshal.GetReference(RemainderUInt32Mask_16x16)),
(uint)(validItems * 16)); // all sixteen floats in the row
Vector512.ConditionalSelect(
Copy link
Member

@EgorBo EgorBo Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for AVX512 it'd looked like

var mmask16 mask = (1 << n) - 1;
var maskedAnd = Avx512.And(vec1, vec2, mask);

but we decided not to expose mask registers so if we expose these as a public API, we can intrinsify at leas the AVX512 version to be cheap 🙂 if it matters, presumably, performance of handling of trailing elements is not that much important esp for large data.

Comment on lines +1407 to +1410
Vector128.ConditionalSelect(
Vector128.LessThan(Vector128.Create(3, 2, 1, 0), Vector128.Create(validItems)).AsSingle(),
Vector128<float>.AllBitsSet,
Vector128<float>.Zero);
Copy link
Member

@tannergooding tannergooding Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be slower and result in a 128-bit constant (or more for other sizes) emitted at runtime per method this gets inlined into.

I think the table is ultimately better and we can get rid of it longer term using a proper JIT intrinsic.


If we really don't want the table, then achieving this with just a broadcast + comparison should be sufficient, since LessThan already produces a per-element mask of AllBitsSet (true) and Zero (false). You can use GreaterThanOrEqual if you need the mask inverted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll stick with the table. I was on the fence anyway. Just don't like its bulkiness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, since TrailingMask wants to handle only trailing elements, it will skip processing the first n items that have already been processed.

So if 2 items remain, you have 3 < 2, 2 < 2, 1 < 2, 0 < 2 which produces Zero, Zero, AllBitsSet, AllBitsSet already

@ghost ghost locked as resolved and limited conversation to collaborators Oct 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants