-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf][Windows_NT][x86] Investigate ByteMark/BenchLUDecomp regression. #9833
Comments
Will investigate. |
For x86, I see 2.0 at about 1060, 2.1 at 1068. So maybe 1% slower. For x64, I see 2.0 at 714, 2.1 at 714. So I don't see any regression here. @jorive can you double-check what you saw? |
Oops, I was looking at SciMark's LU. Hold on a sec. |
Ok, for Bytemark LU, x86: 2.0 is around 2000, 2.1 is noisy but somewhere between 2060 and 2110. History is interesting here -- both master and 2.0 show a large impact from a CSE bug fix: dotnet/coreclr#15323 / dotnet/coreclr#15360 respectively. master has a couple of later smaller regressions that look to be harder to pin down. Here's master: and here's 2.0: So it's probably worth looking back at that change too. Note x64 perf on the same test is comparable in 2.0 and master/2.1 and was not impacted by the CSE fix. |
Time is spent in Some experimentation shows cloning hurts perf quite a bit too (locally, in master, goes from 1730 with cloning to 1520 without). So I'll link this to the general item for re-examining cloning: #8558. |
Backing out the changes from dotnet/coreclr#15323 gets me to around 1470 locally in master. Will look at diffs and see if we were cheating before or we lost a good optimization somehow. |
Still not super-confident I understand why the newer version is slower. Inner loops look pretty comparable with one exception. I think this is the clone version of the first deeply nested loop: for (k = 0; k < i; k++)
sum -= a[i][k] * a[k][j]; In the after version the loop body is disconnected and there is a spill restore from ;;; Before
G_M19375_IG39:
8B5D88 mov ebx, gword ptr [ebp-78H]
8BFB mov edi, ebx
C4E17B1054D708 vmovsd xmm2, qword ptr [edi+8*edx+8]
8B7C9108 mov edi, gword ptr [ecx+4*edx+8]
3B7704 cmp esi, dword ptr [edi+4]
0F8342080000 jae G_M19375_IG101
C4E16B5954F708 vmulsd xmm2, qword ptr [edi+8*esi+8]
C4E1735CCA vsubsd xmm1, xmm2
42 inc edx
3BD6 cmp edx, esi
895D88 mov gword ptr [ebp-78H], ebx
7CD3 jl SHORT G_M19375_IG39
;;; AFTER
G_M19375_IG37:
8B5D90 mov ebx, gword ptr [ebp-70H]
8BCB mov ecx, ebx
C4E17B1054D108 vmovsd xmm2, qword ptr [ecx+8*edx+8]
8B4DC8 mov ecx, gword ptr [ebp-38H]
8B449108 mov eax, gword ptr [ecx+4*edx+8]
3B7004 cmp esi, dword ptr [eax+4]
0F8346080000 jae G_M19375_IG102
C4E16B5954F008 vmulsd xmm2, qword ptr [eax+8*esi+8]
C4E1735CCA vsubsd xmm1, xmm2
42 inc edx
3BD6 cmp edx, esi
895D90 mov gword ptr [ebp-70H], ebx
7C0B jl SHORT G_M19375_IG38
8B5DEC mov ebx, dword ptr [ebp-14H]
897DD0 mov dword ptr [ebp-30H], edi
8B45CC mov eax, dword ptr [ebp-34H]
EB7B jmp SHORT G_M19375_IG48
G_M19375_IG38:
894DC8 mov gword ptr [ebp-38H], ecx
EBC0 jmp SHORT G_M19375_IG37 This method is probably a good stress test for spill placement as a whole, as there are large runs of blocks to spill and reload. So linking in dotnet/coreclr#16857. Likely cloning plays a big role in creating lots of pressure and backegdes. G_M19375_IG40:
8B5D88 mov ebx, gword ptr [ebp-78H]
EB2F jmp SHORT G_M19375_IG48
G_M19375_IG41:
8B5D88 mov ebx, gword ptr [ebp-78H]
EB2A jmp SHORT G_M19375_IG48
G_M19375_IG42:
895D88 mov gword ptr [ebp-78H], ebx
EB5E jmp SHORT G_M19375_IG49
G_M19375_IG43:
895D88 mov gword ptr [ebp-78H], ebx
EB59 jmp SHORT G_M19375_IG49
G_M19375_IG44:
8B55F0 mov edx, dword ptr [ebp-10H]
E972010000 jmp G_M19375_IG56
G_M19375_IG45:
8B5DC0 mov ebx, gword ptr [ebp-40H]
E9C0FEFFFF jmp G_M19375_IG35
G_M19375_IG46:
895DC0 mov gword ptr [ebp-40H], ebx
E9F5FEFFFF jmp G_M19375_IG36
G_M19375_IG47:
895DC0 mov gword ptr [ebp-40H], ebx
E9EDFEFFFF jmp G_M19375_IG36 So for the immediate 2.1 vs 2.0 issue I don't see anything fixable in the 2.1 timeframe. I am still going to try and understand the larger regression that both these branches saw a while back... |
Looking at the impact of dotnet/coreclr#15323, while jitting
These statements all seem to sit in blocks that are on critical edges, eg BB119 below:
So these blocks and statements hang around a while. But the dead statements eventually get removed and the flowgraph gets compacted back into the same shape. However the block numbers are different (BB11 before becomes BB119 after) and in the after case the numbers no longer match the linear order:
This in turn changes the LSRA block sequence:
Note in before BB11 is visited quite early while in after the equivalent BB119 is visited quite late. And with this different allocation sequence we get different and arguably worse allocation overall. So it appears there is some sensitivity in LSRA to block IDs and that (at cursory glance anyways) things work better when block IDs and block physical order going into LSRA corresponded one to one. It seems that this might be commonly true as the late dead code pass probably rarely allows for flowgraph simplification, but it's not guaranteed to be true, and isn't true for this method after the change in dotnet/coreclr#15323. @CarolEidt any thoughts? Should I try ensuring that blocks are renumbered going into LSRA? Need to see if this somehow interferes with liveness. |
Added a call to This puts jit-diffs results for x86:
|
@CarolEidt I'm going to assign this over to you for further evaluation -- let me know if you have any questions. I suspect we'll want to push this out of 2.1 but will let you make that call. |
@CarolEidt 2.1? |
@AndyAyersMS - I made some adjustments to the ordering that I thought ought to be better. It's a net win according to jit-diff, but shows just one regression- in ludcmp, even though jitting it results in a net improvement, with the loop you identify above not being split. There's more spill and more moves, but fewer split edges (21 instead of 43), and many fewer split backedges. Dumping disasm and then running jit-analyze on it (jitted version) shows:
However, running Release versions on my system shows that it is actually slower (though the slowdown is less than the standard deviation). I attempted to start a private perf run, but it's been "pending - Waiting for next available executor" for some time. At this point, I think it's best to choose discretion over valor, and defer this to "Future". |
There is ~4.5% regression from
release/2.0.0
torelease/2.1
category:cq
theme:benchmarks
skill-level:intermediate
cost:small
The text was updated successfully, but these errors were encountered: