-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613
Comments
The extra no-ops should be gone in current master after dotnet/coreclr#26740 . |
cc @adamsitnik |
We should verify perf is recovered. Let's take a look as part of .NET 5. |
@erozenfeld Can you verify if any work remains here? |
I still see a regression with a fairly recent P8 build:
|
@briansull if you haven't started looking at this, I'll take a look. |
Sure, go ahead and take a look |
Suspect this is some residual overhead from tiered compilation. Disabling that gives
Differential profiles of 4.8 vs 5.0 show identical codegen in all the hot methods, eg: ;; 4.8
;; Program+PerformanceRegression+Tests+SimplePushStreamTest@42-6.Invoke(Int64)
00007ff8`f2816040 488b4908 mov rcx,qword ptr [rcx+8]
00007ff8`f2816044 48ffc2 inc rdx
00007ff8`f2816047 488b01 mov rax,qword ptr [rcx]
00007ff8`f281604a 488b4040 mov rax,qword ptr [rax+40h]
00007ff8`f281604e 488b4020 mov rax,qword ptr [rax+20h]
00007ff8`f2816052 48ffe0 jmp rax
;; 5.0
;; Assembly listing for method SimplePushStreamTest@42-6:Invoke(long):Unit:this
G_M35068_IG01:
;; bbWeight=1 PerfScore 0.00
G_M35068_IG02:
mov rcx, gword ptr [rcx+8]
inc rdx
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+64]
mov rax, qword ptr [rax+32]
;; bbWeight=1 PerfScore 8.25
G_M35068_IG03:
rex.jmp rax cc @kouvel Going to relabel this as VM. |
This seems to have something to do with JIT timing. I have seen some cases before where different loop alignment led to noticeable regressions when tiering is enabled. It could also be better at random, but naturally the regressions are more noticed. Adjusting timing of rejits by changing the call count threshold seems to change the perf significantly. Code locality may also be relevant, as I noticed that using R2R'ed runtime binaries versus using IL-only runtime binaries also affects perf significantly regardless of tiering. Tiering timings have changed between 3.1 and 5.0, typically it would be for the better, though there may be benchmarks where slight differences in timing may realize as larger differences in performance. I suspect this is not a regression due to a bug but rather a regression due to chance. It would need more investigation to determine the root causes, and if my theory is right, to determine how to reduce the chance factor and make it more deterministic. |
I see large differences in time spent in a function when the function's code crosses a cache line boundary. For small functions it may help to align them such that they would fit within a cache line. |
#2249 added 32 byte alignment for tier1 methods with loops; might be interesting to try this for all tier1 methods. |
I see, I was also thinking something like this:
Might allow for a bit more density but #2249 also mentioned that crossing a 32-byte boundary has perf issues for loops, so 32-byte alignment may work better in some cases. |
Hi.
I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.
I noticed it when discussing my other issue: #2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.
When setting up a simple push stream pipeline
Benchmark dotnet reports:
.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.
I dug a bit into the jitted assembler and found the following differences
It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.
It surprises me that the nop adds this much overhead but I can't spot anything else of significance.
The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression
And here:
category:cq
theme:optimization
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: