-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When calculating with xxhash128, it is slower. #103539
Comments
Please share a standalone benchmark using benchmarkdotnet. My guess is you're measuring the native code behind SHA256 against unoptimized C# due to not having tiered up yet to optimized code. Measuring a single run with DateTime is not going to give you a representative answer for sustained throughput. |
Using using System.IO;
using System.IO.Hashing;
using System.Security.Cryptography;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<Bench>();
[MemoryDiagnoser]
public class Bench {
[Benchmark]
public byte[] BenchSHA256()
{
using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
return SHA256.HashData(fs);
}
[Benchmark]
public byte[] BenchXxHash128()
{
using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
XxHash128 hash = new();
hash.Append(fs);
return hash.GetHashAndReset();
}
}
Where I will also note that in your example, you are reading from the file in chunks and then appending it to either |
I ran this bench on Azure arm64 vs x64 and arm64 seems to be 3x slower (x64 has avx2 though), see #99871 (comment) after inspecting asm, it seems that arm64 impl suffers a lot from |
Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in Vector128<ulong> Mul(Vector128<ulong> a, Vector128<ulong> b) => a * b; Avx512: ; Method Bench:Mul
vmovups xmm0, xmmword ptr [r8]
vpmullq xmm0, xmm0, xmmword ptr [r9]
vmovups xmmword ptr [rdx], xmm0
mov rax, rdx
ret
; Total bytes of code: 19 Avx2 only: ; Method Bench:Mul
push rsi
push rbx
sub rsp, 104
mov rbx, rdx
mov rdx, qword ptr [r8]
mov qword ptr [rsp+0x58], rdx
mov rdx, qword ptr [r9]
mov qword ptr [rsp+0x50], rdx
mov rdx, qword ptr [rsp+0x58]
imul rdx, qword ptr [rsp+0x50]
mov qword ptr [rsp+0x60], rdx
mov rsi, qword ptr [rsp+0x60]
mov rdx, qword ptr [r8+0x08]
mov qword ptr [rsp+0x40], rdx
mov rdx, qword ptr [r9+0x08]
mov qword ptr [rsp+0x38], rdx
mov rcx, qword ptr [rsp+0x40]
mov rdx, qword ptr [rsp+0x38]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+0x48], rax
mov rax, qword ptr [rsp+0x48]
mov qword ptr [rsp+0x20], rsi
mov qword ptr [rsp+0x28], rax
vmovaps xmm0, xmmword ptr [rsp+0x20]
vmovups xmmword ptr [rbx], xmm0
mov rax, rbx
add rsp, 104
pop rbx
pop rsi
ret
; Total bytes of code: 120 Arm64: ; Method Bench:Mul
G_M34924_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x60]!
str d8, [sp, #0x58]
mov fp, sp
mov v16.16b, v0.16b
str d16, [fp, #0x48] // [V08 tmp5]
mov v16.16b, v1.16b
str d16, [fp, #0x40] // [V09 tmp6]
ldp x0, x1, [fp, #0x40] // [V09 tmp6], [V08 tmp5]
mul x1, x0, x1
str x1, [fp, #0x50] // [V06 tmp3]
ldr d8, [fp, #0x50] // [V06 tmp3]
ext v16.16b, v0.16b, v0.16b, #8
str d16, [fp, #0x30] // [V25 tmp22]
ext v16.16b, v1.16b, v1.16b, #8
str d16, [fp, #0x28] // [V26 tmp23]
ldp x1, x0, [fp, #0x28] // [V26 tmp23], [V25 tmp22]
movz x2, #0xD740 // code for System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong
movk x2, #0xB2A5 LSL #16
movk x2, #0x7FFA LSL #32
ldr x2, [x2]
blr x2 // calling Scalar`1[ulong]:Multiply(ulong,ulong):ulong helper
str x0, [fp, #0x38] // [V23 tmp20]
ldr d0, [fp, #0x38] // [V23 tmp20]
ldr q16, [fp, #0x10] // [V38 tmp35]
ins v16.d[0], v8.d[0]
ins v16.d[1], v0.d[0]
mov v0.16b, v16.16b
ldr d8, [sp, #0x58]
ldp fp, lr, [sp], #0x60
ret lr
; Total bytes of code: 120 TL;DR: XxHash128 hits some terrible penalty on Arm64 and x64 without Avx512 |
The only hash function required is xxhash. I added this tentatively to see what happens if the hash function is changed to SHA256.
|
[Benchmark]
public byte[] BenchSHA256()
{
using var fs = File.OpenRead("test.mp4");
return SHA256.HashData(fs);
}
[Benchmark]
public string BenchXxHash128()
{
using var fs = File.OpenRead("test.mp4");
XxHash128 xxhash = new();
xxhash.Append(fs);
var hash = Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
Console.WriteLine(hash);
return hash;
}
[Benchmark]
public async Task<string> GetHashAsync()
{
var filename = "test.mp4";
int _bufferSize = 1024 * 1024 * 10;
var Buffer = new byte[_bufferSize];
var Status = string.Empty;
using (var entryStream = File.OpenRead(filename))
{
var xxhash = new XxHash128();
long totalBytesRead = 0;
var bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
.ConfigureAwait(false);
var startTime = DateTime.Now;
var speed = new ByteSize();
while (bytesRead > 0)
{
xxhash.Append(Buffer.AsSpan(0, bytesRead));
totalBytesRead += bytesRead;
speed = ByteSize.FromBytes(totalBytesRead / (DateTime.Now - startTime).TotalSeconds);
Status = $"{speed.LargestWholeNumberValue:#.##}{speed.LargestWholeNumberSymbol}/s";
Console.WriteLine(Status);
bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
.ConfigureAwait(false);
}
Array.Clear(Buffer, 0, Buffer.Length);
return Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
}
} // * Summary * BenchmarkDotNet v0.13.12, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
The speed indicated by |
@DogFortune Your benchmark code contains too many mixture of unrelated stuffs. However, it seems that the default buffer size for [Benchmark]
[Arguments(4096)]
[Arguments(4096 * 1024)]
public UInt128 GetHash(int bufferSize)
{
using var fs = File.OpenRead(@"D:\data.zip"); // 162 MB
XxHash128 hash = new XxHash128();
byte[] buffer = ArrayPool<byte>.Shared.Rent(bufferSize);
int bytesRead;
while ((bytesRead = fs.Read(buffer)) != 0)
{
hash.Append(buffer.AsSpan(0, bytesRead));
}
ArrayPool<byte>.Shared.Return(buffer);
return hash.GetCurrentHashAsUInt128();
}
4KB is the default buffer size used for |
@huoyaoyuan I set the buffer to 4MB but no improvement. When I run the benchmark in my environment, I get the desired performance, but when I run the same code in MAUI, the performance is extremely poor. |
To be consistent and comparable with |
Are you targeting iOS or macOS? iOS will hit optimization of Mono runtime, which is yet another topic. |
@huoyaoyuan The target is macOS and Windows, IOS is not included. |
@EgorBo, could it be just https://godbolt.org/z/eqsrf341M? |
it looks like the x64 impl is not correct? we need NOTE: we need to accelerate multiplication for both ulong and signed long |
Ah, here is the special casing for avx512dq, avx2 and sse2: https://godbolt.org/z/aWr6rhbxh |
MAUI uses the Mono runtime in macOS.
So it seems the that target runtime in question here is Mono, not CoreCLR. |
tried these, still don't match what avx512 instruction produces when I run them through tests |
Tagging subscribers to this area: @lambdageek, @steveisok |
It seems to be a Mono codegen issue, so I am moving it to area-Codegen-JIT-mono area. |
I took @stephentoub benchmark from #103669 and run it for .NET 9 CLR vs Mono: dotnet run -c Release --filter * --runtimes net9.0 mono9.0 BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-preview.6.24277.6
[Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
Job-CFIFYM : .NET 9.0.0 (9.0.24.27202), X64 RyuJIT AVX2
Job-MISQRF : .NET 9.0.0 (9.0.24.27202) using MonoVM, X64 X86Base
|
BDN numbers using the same benchmark as #103539 (comment) on osx-arm64 with .NET 9 Preview 6:
Still not great. Whatever it is that we're not doing, we're not doing it on both x64 and arm64 |
Hi
I am building an application that calculates the hash value of a file using xxhash.
https://github.com/DogFortune/F2Checker/blob/63f6402ab041db50363a2e37bb2ccd180d04f9e4/F2Checker/Models/AppContext.cs#L29
I know that xxhash is very fast. However, we are currently encountering a problem where the speed is not as fast as we would like.
hanging the hash algorithm to SHA256 improves speed.
These are the results measured in the release build. Measurements were taken on 14 GiB mp4 files.
xxhash128: 350MB/s
SHA256: 1.63GB/s
Please tell me why performance is not improving.
I am Japanese and not good at English, so I am sorry if I am wrong.
Configuration
The text was updated successfully, but these errors were encountered: