-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate Vector128<long>::op_Multiply on x64 #103555
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL |
Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue. |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs
Outdated
Show resolved
Hide resolved
…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>
@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128 using System.IO.Hashing;
using BenchmarkDotNet.Attributes;
public class Bench
{
static readonly byte[] Data = new byte[1000000];
[Benchmark]
public byte[] BenchXxHash128()
{
XxHash128 hash = new();
hash.Append(Data);
return hash.GetHashAndReset();
}
} |
Benchmark results on Intel
Flame graphs: Main vs PR 🔥 For clean |
Benchmark results on Amd
Flame graphs: Main vs PR 🔥 For clean |
Benchmark results on Arm64
Flame graphs: Main vs PR 🔥 For clean |
/azp list |
This comment was marked as resolved.
This comment was marked as resolved.
/azp run runtime-coreclr jitstress-isas-x86 |
Azure Pipelines successfully started running 1 pipeline(s). |
@tannergooding PTAL, I'll add arm64 separately, need to test different impls. Benchmark improvement: #103555 (comment) |
// Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero); | ||
GenTreeHWIntrinsic* tmp3 = | ||
gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type), | ||
is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd, | ||
CORINFO_TYPE_UINT, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know in other places we've started avoiding hadd
in favor of shuffle+add
, might be worth seeing if that's appropriate here too (low priority, non blocking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)
{ | ||
// TODO-XARCH-CQ: We should support long/ulong multiplication | ||
// TODO-XARCH-CQ: 32bit support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's blocking 32-bit support? It doesn't look like we're using any _X64
intrinsics in the fallback logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to be honest, that check was pre-existing, I only changed comment
This PR optimizes
Vector128
andVector256
multiplication forlong/ulong
when AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)Current codegen on x64 cpu without AVX512:
New codegen: