Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryStream.Write() slow on Ryzen CPUs #13009

Closed
LordBenjamin opened this issue Jun 29, 2019 · 18 comments · Fixed by dotnet/coreclr#25750
Closed

MemoryStream.Write() slow on Ryzen CPUs #13009

LordBenjamin opened this issue Jun 29, 2019 · 18 comments · Fixed by dotnet/coreclr#25750
Labels
Milestone

Comments

@LordBenjamin
Copy link

I've noticed while doing some benchmarking that .NET Core is noticeably (21%) slower than .NET Framework on my AMD Ryzen 1200 based PC for a certain piece of code.

Running the same benchmark on an Intel i7 6700 based PC shows .NET Core running significantly (41%) faster than Framework.

Code to reproduce is here:
https://github.com/LordBenjamin/DotNetCore-Ryzen-Performance-Repro/

Benchmarks

  • OriginalBenchmark.cs

Quite close to my actual code. Slower on Ryzen and faster on Intel.

  • StreamWriteBenchmark.cs

Distilled down after I noticed that commenting out the call to MemoryStream.Write(byte[] buffer, int offset, int count); in OriginalBenchmark.cs leaves .NET Core consistently faster on both CPUs. Core performance is significantly worse on Ryzen, but comparable to Framework on Intel.

Affected Frameworks

I can reproduce using both .NET Core 2.1 and 3.0. Framework version is 4.7.2 in both cases.

Results

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  68.21 us | 0.4611 us | 0.4088 us |
| StreamWrite | Core |    Core | 118.42 us | 0.5408 us | 0.5058 us |
| Original    |  Clr |     Clr |  502.4 us | 6.2900 us | 5.8840 us |
| Original    | Core |    Core |  630.2 us | 3.8080 us | 3.5620 us |
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.700
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3801.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |     Mean |     Error |    StdDev |
|------------ |----- |-------- |---------:|----------:|----------:|
| StreamWrite |  Clr |     Clr | 75.37 us | 0.9133 us |  0.7626 us|
| StreamWrite | Core |    Core | 75.39 us | 0.8624 us |  0.8067 us|
| Original    |  Clr |     Clr | 598.0 us |11.6530 us | 14.74 us  |
| Original    | Core |    Core | 341.6 us | 6.8020 us | 10.79 us  |

Happy to supply further information or change and re-run benchmarks as required.

@stephentoub stephentoub transferred this issue from dotnet/corefx Jun 30, 2019
@LordBenjamin
Copy link
Author

Additional benchmark data from a machine with a Ryzen 1600X. Same characteristics:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 5 1600X, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.2.100
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
 
 
|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  59.30 us | 0.4784 us | 0.4241 us |
| StreamWrite | Core |    Core | 124.52 us | 1.5506 us | 1.3746 us |

@LordBenjamin LordBenjamin changed the title Potential Ryzen performance edge-case MemoryStream.Write() slow on Ryzen CPUs Jul 1, 2019
@LordBenjamin
Copy link
Author

May not be strictly related to Ryzen - I've managed to reproduce on a Xeon (albeit running inside a VM):

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.504 (1809/October2018Update/Redstone5), VM=Hyper-V
Intel Xeon Platinum 8168 CPU 2.70GHz, 1 CPU, 36 logical and 18 physical cores
.NET Core SDK=2.1.700
  [Host] : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 2.1.11 (CoreCLR 4.6.27617.04, CoreFX 4.6.27617.02), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  89.87 us | 1.7903 us | 2.1312 us |
| StreamWrite | Core |    Core | 110.88 us | 0.8309 us | 0.7772 us |

@danmoseley
Copy link
Member

Is it possible to show a comparison of 2.1 and 3.0?

@LordBenjamin
Copy link
Author

Re-ran on Ryzen 1200 using 3.0 - I can run on the Intel machines later if required.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.557 (1809/October2018Update/Redstone5)
AMD Ryzen 3 1200, 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |    Error |   StdDev |
|------------ |----- |-------- |----------:|---------:|---------:|
| StreamWrite |  Clr |     Clr |  68.10 us | 1.341 us | 1.696 us |
| StreamWrite | Core |    Core | 111.50 us | 1.235 us | 1.095 us |

@LordBenjamin
Copy link
Author

Got a new CPU (Ryzen 3600) recently and .NET 4.7.2 runs considerably faster, but Core 3.0 runs even slower:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.615 (1809/October2018Update/Redstone5)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview6-012264
  [Host] : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3416.0
  Core   : .NET Core 3.0.0-preview6-27804-01 (CoreCLR 4.700.19.30373, CoreFX 4.700.19.30308), 64bit RyuJIT


|      Method |  Job | Runtime |      Mean |     Error |    StdDev |
|------------ |----- |-------- |----------:|----------:|----------:|
| StreamWrite |  Clr |     Clr |  44.45 us | 0.0486 us | 0.0430 us |
| StreamWrite | Core |    Core | 138.92 us | 0.4383 us | 0.4100 us |

@danmoseley
Copy link
Member

Perhaps @tannergooding has thoughts.

@janvorli
Copy link
Member

I am just about to refresh my private PC with Ryzen 3600, so unless someone beats me on that, I can look into it once I have the new machine set up. I expects that to happen in about two weeks.

@tannergooding
Copy link
Member

I'll profile on my 1800X when I get into the office. I'll be getting a 3900X in a few days and can test on that then as well.

@tannergooding
Copy link
Member

tannergooding commented Jul 15, 2019

After investigating, almost the entire time for both .NET Core and full framework is being spent in a memcpy routine. In full framework, it is calling memcpy directly; However, in .NET Core it is calling JIT_MemCpy in src/vm/amd64/CrtHelpers.asm and that looks to be causing the ~2x slowdown.

@tannergooding
Copy link
Member

Looks like this was changed in dotnet/coreclr#7198

@janvorli
Copy link
Member

Interesting, I was almost sure it will be that :-).

@janvorli
Copy link
Member

It would be interesting to try if changing the rep movsb cutoff that's set to 512 bytes would help Ryzen to perform better.

@tannergooding
Copy link
Member

I'll give that a try. Also going to grab perf metrics using the CRT implementation method.

@tannergooding
Copy link
Member

For reference, the AMD Optimization Manuals currently give the following guidelines for rep stos
image

From the guidance given and the current source code, the following things are likely hurting perf:

  • Not using the largest possible operand size
  • Not attempting to align the source/destination to 16-bytes (or at least the destination if both can't be)
  • Not using other techniques if the repeat count is larger than the data cache size

@tannergooding
Copy link
Member

Reverting to use the CRT implementation makes it the same speed as .NET Framework (it actually looks to be consistently about 1% faster, but that is likely within error).

Changing the JIT_MemCpy algorithm to remove the rep stosb path makes the code ~2% slower than the .NET Framework code.

As @jkotas called out in the original PR (dotnet/coreclr#7198 (comment)), having a private copy of memset/memcpy has a number of drawbacks. It also certainly hasn't undergone nearly the same amount of testing/profiling as the CRT implementation.

I would be very much in favor of removing this path, especially since it causes a nearly 2x slowdown on Ryzen machines. It is also missing a bunch of logic around prefetching, ensuring data alignment, etc that the CRT implementation does currently have.

@tannergooding
Copy link
Member

@jkotas, @janvorli. I this something we should consider fixing for 3.0, considering it is a nearly 2x penalty for copying memory on Ryzen CPUs?

The fix is relatively trivial, removes us having a custom memcpy routine (which is already 64-bit Windows only), and won't significantly regress Intel CPUs (based on the numbers in the original PR, it was 33% average for inputs under 512 bytes and no real difference for inputs over 512 bytes).

@jkotas
Copy link
Member

jkotas commented Jul 17, 2019

In any case, you should submit a fix to master for .NET 5. We can then decide whether it will get backported to release/3.0.

@tannergooding
Copy link
Member

I put up dotnet/coreclr#25750 to resolve the issue for .NET 5

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
7 participants