Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When calculating with xxhash128, it is slower. #103539

Open
DogFortune opened this issue Jun 16, 2024 · 20 comments
Open

When calculating with xxhash128, it is slower. #103539

DogFortune opened this issue Jun 16, 2024 · 20 comments
Labels
Milestone

Comments

@DogFortune
Copy link

DogFortune commented Jun 16, 2024

Hi

I am building an application that calculates the hash value of a file using xxhash.
https://github.com/DogFortune/F2Checker/blob/63f6402ab041db50363a2e37bb2ccd180d04f9e4/F2Checker/Models/AppContext.cs#L29

I know that xxhash is very fast. However, we are currently encountering a problem where the speed is not as fast as we would like.
hanging the hash algorithm to SHA256 improves speed.

These are the results measured in the release build. Measurements were taken on 14 GiB mp4 files.

xxhash128: 350MB/s
SHA256: 1.63GB/s

Please tell me why performance is not improving.

I am Japanese and not good at English, so I am sorry if I am wrong.

Configuration

  • .NET 8
  • .NET MAUI
  • macOS Sonoma 14.5
  • Apple M1 Pro
@DogFortune DogFortune added the tenet-performance Performance related issue label Jun 16, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 16, 2024
@stephentoub
Copy link
Member

Please share a standalone benchmark using benchmarkdotnet. My guess is you're measuring the native code behind SHA256 against unoptimized C# due to not having tiered up yet to optimized code. Measuring a single run with DateTime is not going to give you a representative answer for sustained throughput.

@vcsjones vcsjones added area-System.IO.Hashing and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 16, 2024
@vcsjones
Copy link
Member

Using XxHash128 on an Apple M1 Ultra, I get better throughput.

using System.IO;
using System.IO.Hashing;
using System.Security.Cryptography;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>();

[MemoryDiagnoser]
public class Bench {
    [Benchmark]
    public byte[] BenchSHA256()
    {
        using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
        return SHA256.HashData(fs);
    }

    [Benchmark]
    public byte[] BenchXxHash128()
    {
        using var fs = File.OpenRead("/Users/vcsjones/Projects/scratch/input.bin");
        XxHash128 hash = new();
        hash.Append(fs);
        return hash.GetHashAndReset();
    }
}
Method Mean Error StdDev Allocated
BenchSHA256 8.118 s 0.0091 s 0.0081 s 5.46 KB
BenchXxHash128 3.032 s 0.0139 s 0.0123 s 5.77 KB

Where input.bin is a 14 GB random binary file.

I will also note that in your example, you are reading from the file in chunks and then appending it to either SHA256 or XxHash128. If all you need is one of them, then both SHA256 and XxHash128 have the ability to read directly from a stream, as demonstrated above. In the case of SHA256, using the static SHA256.HashData will be significantly more efficient. There are also Async versions that accept a cancellation token to use if you need them.

@EgorBo
Copy link
Member

EgorBo commented Jun 16, 2024

I ran this bench on Azure arm64 vs x64 and arm64 seems to be 3x slower (x64 has avx2 though), see #99871 (comment)

after inspecting asm, it seems that arm64 impl suffers a lot from Vector128<ulong> * Vector128<ulong> operation which is not accelerated

@EgorBo
Copy link
Member

EgorBo commented Jun 16, 2024

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

Vector128<ulong> Mul(Vector128<ulong> a, Vector128<ulong> b) => a * b;

Avx512:

; Method Bench:Mul
       vmovups  xmm0, xmmword ptr [r8]
       vpmullq  xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rdx], xmm0
       mov      rax, rdx
       ret      
; Total bytes of code: 19

Avx2 only:

; Method Bench:Mul
       push     rsi
       push     rbx
       sub      rsp, 104
       mov      rbx, rdx
       mov      rdx, qword ptr [r8]
       mov      qword ptr [rsp+0x58], rdx
       mov      rdx, qword ptr [r9]
       mov      qword ptr [rsp+0x50], rdx
       mov      rdx, qword ptr [rsp+0x58]
       imul     rdx, qword ptr [rsp+0x50]
       mov      qword ptr [rsp+0x60], rdx
       mov      rsi, qword ptr [rsp+0x60]
       mov      rdx, qword ptr [r8+0x08]
       mov      qword ptr [rsp+0x40], rdx
       mov      rdx, qword ptr [r9+0x08]
       mov      qword ptr [rsp+0x38], rdx
       mov      rcx, qword ptr [rsp+0x40]
       mov      rdx, qword ptr [rsp+0x38]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+0x48], rax
       mov      rax, qword ptr [rsp+0x48]
       mov      qword ptr [rsp+0x20], rsi
       mov      qword ptr [rsp+0x28], rax
       vmovaps  xmm0, xmmword ptr [rsp+0x20]
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx
       add      rsp, 104
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 120

Arm64:

; Method Bench:Mul
G_M34924_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x60]!
            str     d8, [sp, #0x58]
            mov     fp, sp
            mov     v16.16b, v0.16b
            str     d16, [fp, #0x48]	// [V08 tmp5]
            mov     v16.16b, v1.16b
            str     d16, [fp, #0x40]	// [V09 tmp6]
            ldp     x0, x1, [fp, #0x40]	// [V09 tmp6], [V08 tmp5]
            mul     x1, x0, x1
            str     x1, [fp, #0x50]	// [V06 tmp3]
            ldr     d8, [fp, #0x50]	// [V06 tmp3]
            ext     v16.16b, v0.16b, v0.16b, #8
            str     d16, [fp, #0x30]	// [V25 tmp22]
            ext     v16.16b, v1.16b, v1.16b, #8
            str     d16, [fp, #0x28]	// [V26 tmp23]
            ldp     x1, x0, [fp, #0x28]	// [V26 tmp23], [V25 tmp22]
            movz    x2, #0xD740      // code for System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong
            movk    x2, #0xB2A5 LSL #16
            movk    x2, #0x7FFA LSL #32
            ldr     x2, [x2]
            blr     x2 // calling Scalar`1[ulong]:Multiply(ulong,ulong):ulong helper
            str     x0, [fp, #0x38]	// [V23 tmp20]
            ldr     d0, [fp, #0x38]	// [V23 tmp20]
            ldr     q16, [fp, #0x10]	// [V38 tmp35]
            ins     v16.d[0], v8.d[0]
            ins     v16.d[1], v0.d[0]
            mov     v0.16b, v16.16b
            ldr     d8, [sp, #0x58]
            ldp     fp, lr, [sp], #0x60
            ret     lr
; Total bytes of code: 120

TL;DR: XxHash128 hits some terrible penalty on Arm64 and x64 without Avx512

@DogFortune
Copy link
Author

The only hash function required is xxhash. I added this tentatively to see what happens if the hash function is changed to SHA256.
I plan to implement a progress bar in the future, so I load it in chunks.
I intend to measure benchmarks in my environment as well. I will share the results when they are available.

Where input.bin is a 14 GB random binary file.

I will also note that in your example, you are reading from the file in chunks and then appending it to either SHA256 or XxHash128. If all you need is one of them, then both SHA256 and XxHash128 have the ability to read directly from a stream, as demonstrated above. In the case of SHA256, using the static SHA256.HashData will be significantly more efficient. There are also Async versions that accept a cancellation token to use if you need them.

@DogFortune
Copy link
Author

DogFortune commented Jun 17, 2024

    [Benchmark]
    public byte[] BenchSHA256()
    {
        using var fs = File.OpenRead("test.mp4");
        return SHA256.HashData(fs);
    }
    
    [Benchmark]
    public string BenchXxHash128()
    {
        using var fs = File.OpenRead("test.mp4");
        XxHash128 xxhash = new();
        xxhash.Append(fs);
        var hash = Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
        Console.WriteLine(hash);
        return hash;
    }

	[Benchmark]
    public async Task<string> GetHashAsync()
    {
        var filename = "test.mp4";
        int _bufferSize = 1024 * 1024 * 10;
        var Buffer = new byte[_bufferSize];
        var Status = string.Empty;
        using (var entryStream = File.OpenRead(filename))
        {
            var xxhash = new XxHash128();
            long totalBytesRead = 0;
            var bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
                .ConfigureAwait(false);
            var startTime = DateTime.Now;
            var speed = new ByteSize();
            while (bytesRead > 0)
            {
                xxhash.Append(Buffer.AsSpan(0, bytesRead));
                totalBytesRead += bytesRead;
                speed = ByteSize.FromBytes(totalBytesRead / (DateTime.Now - startTime).TotalSeconds);
                Status = $"{speed.LargestWholeNumberValue:#.##}{speed.LargestWholeNumberSymbol}/s";
                Console.WriteLine(Status);
                bytesRead = await entryStream.ReadAsync(Buffer, 0, Buffer.Length)
                    .ConfigureAwait(false);
            }
            Array.Clear(Buffer, 0, Buffer.Length);
            return Convert.ToHexString(xxhash.GetHashAndReset()).ToLower();
        }
    }

// * Summary *

BenchmarkDotNet v0.13.12, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD

Method Mean Error StdDev Allocated
BenchSHA256 9.978 s 0.1320 s 0.1235 s 5.46 KB
BenchXxHash128 4.203 s 0.0452 s 0.0401 s 5.94 KB
GetHashAsync 2.623 s 0.0396 s 0.0309 s 10.27 MB

The speed indicated by GetHashAsync was 5.5GB/s. Benchmark measured results were very high performance, so we can conclude that there is no problem with the code.
Is the fact that it is a GUI application using net8.0-maccatalyst the cause of the performance degradation?

@huoyaoyuan
Copy link
Member

@DogFortune Your benchmark code contains too many mixture of unrelated stuffs.

However, it seems that the default buffer size for Append(Stream) is too small:

    [Benchmark]
    [Arguments(4096)]
    [Arguments(4096 * 1024)]
    public UInt128 GetHash(int bufferSize)
    {
        using var fs = File.OpenRead(@"D:\data.zip"); // 162 MB
        XxHash128 hash = new XxHash128();
        byte[] buffer = ArrayPool<byte>.Shared.Rent(bufferSize);
        int bytesRead;
        while ((bytesRead = fs.Read(buffer)) != 0)
        {
            hash.Append(buffer.AsSpan(0, bytesRead));
        }
        ArrayPool<byte>.Shared.Return(buffer);

        return hash.GetCurrentHashAsUInt128();
    }
Method bufferSize Mean Error StdDev
GetHash 4096 131.24 ms 2.343 ms 2.077 ms
GetHash 4194304 45.92 ms 0.911 ms 2.130 ms

4KB is the default buffer size used for Append(Stream). Increasing it to 4MB significantly improves the performance.
Your code is using a large buffer size (10MB) and it's probably the reason of performance difference.

@DogFortune
Copy link
Author

@huoyaoyuan I set the buffer to 4MB but no improvement.

When I run the benchmark in my environment, I get the desired performance, but when I run the same code in MAUI, the performance is extremely poor.
The hash calculation process was separated as a .NET Standard 2.0 project and referenced from the console application, which resulted in high performance.

@huoyaoyuan
Copy link
Member

I set the buffer to 4MB but no improvement.

To be consistent and comparable with Append(Stream), you should set the buffer size to 4KB.
Append(Stream) doesn't allow setting buffer size currently, but it would perform better than your code under the same buffer size.

@huoyaoyuan
Copy link
Member

but when I run the same code in MAUI, the performance is extremely poor.

Are you targeting iOS or macOS? iOS will hit optimization of Mono runtime, which is yet another topic.

@DogFortune
Copy link
Author

DogFortune commented Jun 17, 2024

@huoyaoyuan The target is macOS and Windows, IOS is not included.
Right now, I would like to build a satisfactory app on macOS and then start on Windows.

@am11
Copy link
Member

am11 commented Jun 17, 2024

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

@EgorBo, could it be just https://godbolt.org/z/eqsrf341M?

@EgorBo
Copy link
Member

EgorBo commented Jun 17, 2024

Ah, same on x64 when Avx512 is not presented. Basically, XxHash128 hits this issue (in ScrambleAccumulator*):

@EgorBo, could it be just https://godbolt.org/z/eqsrf341M?

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

NOTE: we need to accelerate multiplication for both ulong and signed long

@am11
Copy link
Member

am11 commented Jun 17, 2024

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

Ah, here is the special casing for avx512dq, avx2 and sse2: https://godbolt.org/z/aWr6rhbxh

@vcsjones
Copy link
Member

vcsjones commented Jun 17, 2024

MAUI

MAUI uses the Mono runtime in macOS.

For Android, iOS, and macOS, the environment is implemented by Mono

So it seems the that target runtime in question here is Mono, not CoreCLR.

@EgorBo
Copy link
Member

EgorBo commented Jun 17, 2024

it looks like the x64 impl is not correct? we need _mm_mullo_epi64 which is only available on avx512 (-mavx512vl -mavx512dq)

Ah, here is the special casing for avx512dq, avx2 and sse2: https://godbolt.org/z/aWr6rhbxh

tried these, still don't match what avx512 instruction produces when I run them through tests

Copy link
Contributor

Tagging subscribers to this area: @lambdageek, @steveisok
See info in area-owners.md if you want to be subscribed.

@adamsitnik
Copy link
Member

It seems to be a Mono codegen issue, so I am moving it to area-Codegen-JIT-mono area.

@adamsitnik
Copy link
Member

I took @stephentoub benchmark from #103669 and run it for .NET 9 CLR vs Mono:

dotnet run -c Release --filter * --runtimes net9.0 mono9.0
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-preview.6.24277.6
  [Host]     : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  Job-CFIFYM : .NET 9.0.0 (9.0.24.27202), X64 RyuJIT AVX2
  Job-MISQRF : .NET 9.0.0 (9.0.24.27202) using MonoVM, X64 X86Base
Method Runtime Kind Mean Ratio Allocated Alloc Ratio
Hash .NET 9.0 FileStream 111.71 us 1.00 600 B 1.00
Hash Mono with .NET 9.0 FileStream 3,160.01 us 28.29 602 B 1.00
Hash .NET 9.0 MemoryStream 63.70 us 1.00 600 B 1.00
Hash Mono with .NET 9.0 MemoryStream 3,195.97 us 50.17 602 B 1.00

@lambdageek lambdageek removed the untriaged New issue has not been triaged by the area owner label Jul 17, 2024
@lambdageek lambdageek added this to the 9.0.0 milestone Jul 17, 2024
@lambdageek
Copy link
Member

BDN numbers using the same benchmark as #103539 (comment) on osx-arm64 with .NET 9 Preview 6:

Method Runtime Kind Mean Ratio Gen0 Allocated Alloc Ratio
Hash .NET 9.0 FileStream 85.75 us 1.00 - 600 B 1.00
Hash Mono with .NET 9.0 FileStream 1,559.28 us 18.17 - 857 B 1.43
Hash .NET 9.0 MemoryStream 45.72 us 1.00 0.0610 600 B 1.00
Hash Mono with .NET 9.0 MemoryStream 1,512.97 us 33.10 - 601 B 1.00

Still not great. Whatever it is that we're not doing, we're not doing it on both x64 and arm64

@steveisok steveisok modified the milestones: 9.0.0, 10.0.0 Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants