-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414
Comments
I'd recommend to rewrite the benchmark to smaller sub-benchmarks and with Benchmark.NET and let it to care about iterations and input, your code also measures Random.NextValue speed which is not desired I guess. |
It looks like you got trapped by Tiered Compilation, see also Run-time configuration options for compilation on how to disable it. For benchmarking it is advisable to use https://benchmarkdotnet.org/. |
Thanks for reporting it @kaalus.
|
The code does tier up as the key methods are called in a loop, and the driver methods with loops will bypass Tier0. It takes about 200ms on my machine for Tier1 code to get installed. However, I also see apparent regressions with tiering disabled. I converted this over to BDN (keeping the code structure as is) and it also reports a regression from desktop. The 3.1 runs are quite a bit more varied and somewhat bimodal, and their "fast" time is closer to 4.8's.
So this merits a closer look. Note runtimes seem pretty volatile -- if I reduce the value of |
Probably not going to get to this for 5.0, so marking as future. |
Running a further modified BDN version (no looping in the test method, no random values) still shows 5.0 regressions: (edit -- disregard this, see below)
|
In the above all the time is spent constructing the
and still what looks like a clear regression. |
Results still fluctuate quite a bit from run to run and sometimes only show a very small regression. ETL analysis of just benchmark intervals shows:
so suggests several methods may have worse CQ; however codegen looks similar.
@kaalus do you have a more fully realized use case for this code that shows regressions? |
@jakobbotsch is going to take a fresh look at this. |
The code generated for .NET 6.0 seems to be almost identical for all but the main driving loop: https://www.diffchecker.com/lztPvQsp (.NET4.8 on the left, .NET 6 on the right)
|
This seems to be related to method ordering and the addresses that the benchmark code ultimately ends up at. The order that .NET 4.8 ends up with is particularly good for the µop cache, it seems. I can regress the .NET 4.8 case by 15% simply by adding more code that is never called during the benchmark, causing the benchmark code to end up being allocated at different addresses. In the good case the BDN code looks like this and we get the following results:
In those cases, the method orderings and addresses look like this:
These addresses seem consistent between runs. Now let's add some random code to the constructor of the benchmark class. Here the function diff --git a/good.cs b/bad.cs
index c5c281a..20b96be 100644
--- a/good.cs
+++ b/bad.cs
@@ -44,16 +44,30 @@ public static class Program
public BenchmarkClass()
{
double[] testValues = new double[iters];
Random rand = new Random(3);
for (int i = 0; i < testValues.Length; i++)
testValues[i] = rand.NextDouble();
_testValues = testValues;
+ Churn(18, new GenT<int> { Value = 3 });
+ }
+
+ struct GenT<T>
+ {
+ public T Value;
+ }
+
+ private int Churn<T>(int amount, GenT<T> value) where T : struct
+ {
+ if (amount == 0)
+ return 0;
+
+ return Churn<GenT<T>>(amount - 1, new GenT<GenT<T>> { Value = value }) + 1;
} Now the results look like this:
The benchmark regressed by 15%. The ordering and the deltas are the same, but the addresses are different:
We can add even more code to get fast benchmarks again, for example at n=30:
Now the question is if we can get the fast version in .NET 6. Without any
The first difference is that .NET 6 places the functions in a different order. The second is that the delta is larger for
From a code locality perspective this seems to be as good as we can do. Yet, even with varying the I think there are a couple of actionable items here:
I don't think there's much we can do here for .NET 6 since CQ seems to be on par or better, so I will move this to .NET 7 for now. However, as a workaround you can mark the explicit double conversion with
|
I will close this given the inlining workaround above, and given that this was mainly micro-architectural. We also have the possibility of doing method ordering in R2R now which addresses some of the concerns around that. |
I was benchmarking the following code (64 bit fixed point type multiplication) under .NET 4.8 and .NET Core 3.1. Due to the jitter improvements made in .NET Core I expected the performance to be better. Unfortunately this code consistently runs about 12% slower than in .NET 4.8:
.NET 4.8: 2.66 sec
.NET Core 3.1: 2.99 sec
(i7 7700k, both projects in Release and "Prefer 32 bit" option disabled)
Please find attached VS solution with both projects: FixedBenchmark.zip
category:cq
theme:needs-triage
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: