-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Windows/x64: 4 Regressions on 5/2/2023 10:35:24 AM #85987
Comments
Commit range is 3e8f17a...4772b5d. Maybe #85620, @jakobbotsch? |
Very possible, I'll take a look |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsRun Information
Regressions in System.Memory.ReadOnlySpan
ReproGeneral Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md Payloadsgit clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Memory.ReadOnlySpan*' PayloadsHistogramSystem.Memory.ReadOnlySpan.Trim(input: "")
Description of detection logic
JIT DisasmsDocsProfiling workflow for dotnet/runtime repository Run Information
Regressions in System.Tests.Perf_Boolean
ReproGeneral Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md Payloadsgit clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Tests.Perf_Boolean*' PayloadsHistogramSystem.Tests.Perf_Boolean.TryParse(value: "Bogus")
Description of detection logic
JIT DisasmsSystem.Tests.Perf_Boolean.Parse(value: " True ")
Description of detection logic
JIT DisasmsDocsProfiling workflow for dotnet/runtime repository Run Information
Regressions in System.Numerics.Tests.Perf_BigInteger
ReproGeneral Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md Payloadsgit clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_BigInteger*' PayloadsHistogramSystem.Numerics.Tests.Perf_BigInteger.Ctor_ByteArray(numberString: -2147483648)
Description of detection logic
JIT DisasmsDocsProfiling workflow for dotnet/runtime repository
|
I have missed a check for a local store in PR #85620 -- we leave [000092] -A-XG------ ▌ STORE_BLK struct<System.ReadOnlySpan`1, 16> (copy)
[000090] -----+----- ├──▌ LCL_VAR byref V00 RetBuf
[000091] ----G+----- └──▌ LCL_VAR struct<System.ReadOnlySpan`1, 16>(AX)(P) V11 tmp9
▌ byref V11.System.ReadOnlySpan`1[ushort]:_reference (offs=0x00) -> V40 tmp38
▌ int V11.System.ReadOnlySpan`1[ushort]:_length (offs=0x08) -> V41 tmp39 as a block copy after that PR and since the destination can be arbitrary heap memory that now requires a helper (that we didn't previously need). Well, not fundamentally, block copying in the backend could be smarter, but currently isn't. |
I think #80086 tracks making it smarter. The cases that are regressing here seem to end up with helper calls for |
Even with #80086 fixed we end up failling back to mov rdi, rbx
lea rsi, bword ptr [rsp+30H]
call CORINFO_HELP_ASSIGN_BYREF
movsq to diff (#80086 fixed) G_M48932_IG12: ;; offset=00B5H
mov rdi, rbx
lea rsi, bword ptr [rsp+30H]
movsq
movsq to diff2 (morphing to field-by-field copy, original codegen) mov rax, bword ptr [rsp+30H]
mov bword ptr [rsi], rax
mov eax, dword ptr [rsp+38H]
mov dword ptr [rsi+08H], eax gives me
The For the time being I will just revert parts of #85620 by allowing field-by-field morphing when the destination is potential heap and the source is a local. |
Keeping this open until I can verify the graphs are back to the old perf levels. |
The diff looks like what I would expect, it replaces field-by-field copies using GPR registers with a single SIMD copy. @EgorBo is it expected that this is so much slower than using multiple GPR registers? |
I don't see these regressions in the FullPGO win-x64 runs: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_Windows%2010.0.18362_PGOType%3Dfullpgo/AllTestindex.html so maybe it's just some intel erratum issue or something like that? |
I think you improved it subsequently, but you can see that my "fix" PR only was a minor improvement on the graph, and didn't exactly return it back to previous levels. My table above is from my own machine (5950X). I think it's likely the same kind of store-forwarding problem that @AndyAyersMS saw recently. Since the mov dword ptr [rsp+68H], eax and then vmovdqu xmm0, xmmword ptr [rsp+60H]
vmovdqu xmmword ptr [rsp+50H], xmm0 is significantly worse compared to the original mov rdx, bword ptr [rsp+60H]
mov bword ptr [rsp+50H], rdx
mov edx, dword ptr [rsp+68H]
mov dword ptr [rsp+58H], edx . The latter only reads the 4 bytes that were previously written, so there is no stall. Of course this problem is not limited to structures with GC pointers, so the heuristic in block morphing was really just getting lucky here... |
OTOH we do zero the full structure in the prolog, so I'm not sure if it's store-forwarding after all (can the CPU piece together two separate stores?). Will try to see if I can check some of the hardware counters. |
Maybe for 2 simd loads we recieve a worse penalty for crossing cache line boundary? Interesting, didn't realize stole-forwarding is such a problem (if it is) |
That's also possible, is the penalty supposed to be this large? I looked at the base/diff in vtune (couldn't get µProf to work). Base is before #85620, diff is the same commit but with #86246 manually applied. However, the "loads blocked by store forwarding" does not really show up where I would expect it, it shows up in The base has the same exact assembly but no "loads blocked by store forwarding", so maybe there's just some drift or misattribution going on by vtune: Seems odd... let me retry some runs with memory randomization. |
We had quite a few regressions in the past where we had no good explanation (becuase codegen was the same) and we ended up blaming code layout in the loader heap (how jitted functions located)/GC. |
Seems like with the block copy the CPU is stuck waiting to resolve that compare/conditional branch, the CPI is awful compared to the base. If I then choose to expand specifically that one with field-by-field copies, it improves, but then the following block copy (that I left) becomes the bottle neck. Sadly we don't really have the framework necessary to analyze this and make a smart decision. So I need to consider whether I should fully revert the change (and accept that we cannot really touch that heuristic) or not. |
You mean a pattern like (wide-store, narrow-store, wide-load)? It is possible the HW can merge stores I suppose, or maybe forward from multiple outstanding stores, if all this happens in a close sequence. But the commentary on https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store would suggest it is unlikely.
I wonder if you are seeing some kind of severe sample skid. Would not ever expect a narrow (byte) load to be impacted by a store forwarding stall. |
Yeah, that's what I meant.
Seems likely to me also. |
I'm going to call this last |
Run Information
Regressions in System.Memory.ReadOnlySpan
Test Report
Repro
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Payloads
Baseline
Compare
Payloads
Baseline
Compare
Histogram
System.Memory.ReadOnlySpan.Trim(input: "")
Description of detection logic
JIT Disasms
Baseline
Compare
Diff
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
Run Information
Regressions in System.Tests.Perf_Boolean
Test Report
Repro
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Payloads
Baseline
Compare
Payloads
Baseline
Compare
Histogram
System.Tests.Perf_Boolean.TryParse(value: "Bogus")
Description of detection logic
JIT Disasms
Baseline
Compare
Diff
System.Tests.Perf_Boolean.Parse(value: " True ")
Description of detection logic
JIT Disasms
Baseline
Compare
Diff
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
Run Information
Regressions in System.Numerics.Tests.Perf_BigInteger
Test Report
Repro
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Payloads
Baseline
Compare
Payloads
Baseline
Compare
Histogram
System.Numerics.Tests.Perf_BigInteger.Ctor_ByteArray(numberString: -2147483648)
Description of detection logic
JIT Disasms
Baseline
Compare
Diff
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
The text was updated successfully, but these errors were encountered: