-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize TensorPrimitives.PopCount #98281
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsUse popcount hardware intrinsics to vectorize TensorPrimitives.PopCount for sizeof(T) == 1 on platforms that have intrinsics. I tried doing this more generally on other platforms with: Vector128<byte> lookup = Vector128.Create((byte)0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
Vector128<byte> cnt1 = Vector128.Shuffle(lookup, x.AsByte() & Vector128.Create((byte)0xF));
Vector128<byte> cnt2 = Vector128.Shuffle(lookup, (x.AsByte() >> 4) & Vector128.Create((byte)0xF));
return (cnt1 + cnt2).As<byte, T>(); and the equivalent for Vector256/512, but it ended up being ~5x slower than just using the scalar byte.PopCount for each element. I also tried using this with SumAbsoluteDifferences in support of sizeof(T) == 8, and it ended up being an order of magnitude worse.
|
...ries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.netcore.cs
Outdated
Show resolved
Hide resolved
...ries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.netcore.cs
Outdated
Show resolved
Hide resolved
46fdc2f
to
e0c3aca
Compare
Contributes to #97193