Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Windows/x64: 4 Regressions on 5/2/2023 10:35:24 AM #85987

Closed
performanceautofiler bot opened this issue May 9, 2023 · 18 comments · Fixed by #86246
Closed

[Perf] Windows/x64: 4 Regressions on 5/2/2023 10:35:24 AM #85987

performanceautofiler bot opened this issue May 9, 2023 · 18 comments · Fixed by #86246
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Milestone

Comments

@performanceautofiler
Copy link

Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Memory.ReadOnlySpan

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Trim - Duration of single invocation 1.21 ns 6.16 ns 5.08 0.06 False 25.748831262174555 32.550956279423076 1.26417218505917) Trace Trace

Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.Trim(input: "")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 6.1584372859667615 > 1.2741125892426048.
IsChangePoint: Marked as a change because one of 4/1/2023 8:54:51 PM, 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -100.89420966987244 (T) = (0 -6.161853322392986) / Math.Sqrt((0.03688747735264881 / (16)) + (5.8223116837808174E-05 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -3.6894142377733004 = (1.3139921128654337 - 6.161853322392986) / 1.3139921128654337 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Tests.Perf_Boolean

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryParse - Duration of single invocation 8.23 ns 16.09 ns 1.95 0.08 True 107.86852800069764 103.31347656402268 0.9577721924911639) Trace Trace
Parse - Duration of single invocation 11.15 ns 23.30 ns 2.09 0.04 True

Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Tests.Perf_Boolean*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Boolean.TryParse(value: "Bogus")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 16.09071662999389 > 8.645237780024415.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -434.8770093633579 (T) = (0 -16.110760034502622) / Math.Sqrt((0.001710119303328648 / (16)) + (0.0039529119334182775 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -0.9521170175578498 = (8.252968387447188 - 16.110760034502622) / 8.252968387447188 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked as regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

System.Tests.Perf_Boolean.Parse(value: " True ")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 23.3007016972274 > 11.713417691505466.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -237.4715523711942 (T) = (0 -23.51061790159726) / Math.Sqrt((0.001930512359178462 / (16)) + (0.04661613103783327 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -1.1090681140482666 = (11.147396210200926 - 23.51061790159726) / 11.147396210200926 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked as regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Numerics.Tests.Perf_BigInteger

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Ctor_ByteArray - Duration of single invocation 12.28 ns 14.23 ns 1.16 0.10 False 150.3472191466558 156.214019115072 1.0390216726435988) Trace Trace

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_BigInteger*'

Payloads

Baseline
Compare

Histogram

System.Numerics.Tests.Perf_BigInteger.Ctor_ByteArray(numberString: -2147483648)


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 14.229877638173855 > 12.895417745614084.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -17.21045122210507 (T) = (0 -14.152620211094012) / Math.Sqrt((0.013736477857531142 / (16)) + (0.21019152484066211 / (21))) is less than -2.0301079282477414 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (21) - 2, .025) and -0.14517624476570484 = (12.358464712992314 - 14.152620211094012) / 12.358464712992314 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-x64 os-windows runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels May 9, 2023
@cincuranet cincuranet removed the untriaged New issue has not been triaged by the area owner label May 9, 2023
@cincuranet cincuranet transferred this issue from dotnet/perf-autofiling-issues May 9, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 9, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 9, 2023
@cincuranet
Copy link
Contributor

Commit range is 3e8f17a...4772b5d. Maybe #85620, @jakobbotsch?

@jakobbotsch
Copy link
Member

Very possible, I'll take a look

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 10, 2023
@ghost
Copy link

ghost commented May 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Memory.ReadOnlySpan

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Trim - Duration of single invocation 1.21 ns 6.16 ns 5.08 0.06 False 25.748831262174555 32.550956279423076 1.26417218505917) Trace Trace

Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.Trim(input: "")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 6.1584372859667615 > 1.2741125892426048.
IsChangePoint: Marked as a change because one of 4/1/2023 8:54:51 PM, 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -100.89420966987244 (T) = (0 -6.161853322392986) / Math.Sqrt((0.03688747735264881 / (16)) + (5.8223116837808174E-05 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -3.6894142377733004 = (1.3139921128654337 - 6.161853322392986) / 1.3139921128654337 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Tests.Perf_Boolean

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
TryParse - Duration of single invocation 8.23 ns 16.09 ns 1.95 0.08 True 107.86852800069764 103.31347656402268 0.9577721924911639) Trace Trace
Parse - Duration of single invocation 11.15 ns 23.30 ns 2.09 0.04 True

Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Tests.Perf_Boolean*'

Payloads

Baseline
Compare

Histogram

System.Tests.Perf_Boolean.TryParse(value: "Bogus")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 16.09071662999389 > 8.645237780024415.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -434.8770093633579 (T) = (0 -16.110760034502622) / Math.Sqrt((0.001710119303328648 / (16)) + (0.0039529119334182775 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -0.9521170175578498 = (8.252968387447188 - 16.110760034502622) / 8.252968387447188 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked as regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

System.Tests.Perf_Boolean.Parse(value: " True ")


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 23.3007016972274 > 11.713417691505466.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -237.4715523711942 (T) = (0 -23.51061790159726) / Math.Sqrt((0.001930512359178462 / (16)) + (0.04661613103783327 / (18))) is less than -2.03693334345674 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (18) - 2, .025) and -1.1090681140482666 = (11.147396210200926 - 23.51061790159726) / 11.147396210200926 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked as regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS Windows 10.0.18362
Queue TigerWindows
Baseline da0aa0cb6944dd49d6c1d1859c4530fe7e38b76f
Compare c62f69be1405a8e41b56ffc05f22d791bf4c7d2d
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Numerics.Tests.Perf_BigInteger

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Ctor_ByteArray - Duration of single invocation 12.28 ns 14.23 ns 1.16 0.10 False 150.3472191466558 156.214019115072 1.0390216726435988) Trace Trace

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

Payloads

Baseline
Compare

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_BigInteger*'

Payloads

Baseline
Compare

Histogram

System.Numerics.Tests.Perf_BigInteger.Ctor_ByteArray(numberString: -2147483648)


Description of detection logic

IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 14.229877638173855 > 12.895417745614084.
IsChangePoint: Marked as a change because one of 5/2/2023 4:25:00 AM, 5/9/2023 7:24:34 AM falls between 4/30/2023 6:17:41 PM and 5/9/2023 7:24:34 AM.
IsRegressionStdDev: Marked as regression because -17.21045122210507 (T) = (0 -14.152620211094012) / Math.Sqrt((0.013736477857531142 / (16)) + (0.21019152484066211 / (21))) is less than -2.0301079282477414 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (16) + (21) - 2, .025) and -0.14517624476570484 = (12.358464712992314 - 14.152620211094012) / 12.358464712992314 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

JIT Disasms

Baseline
Compare
Diff

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: jakobbotsch
Labels:

os-windows, arch-x64, area-CodeGen-coreclr, untriaged, runtime-coreclr, needs-area-label

Milestone: -

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 11, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone May 11, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 11, 2023
@jakobbotsch
Copy link
Member

I have missed a check for a local store in PR #85620 -- we leave

               [000092] -A-XG------                           STORE_BLK struct<System.ReadOnlySpan`1, 16> (copy)
               [000090] -----+-----                         ├──▌  LCL_VAR   byref  V00 RetBuf       
               [000091] ----G+-----                         └──▌  LCL_VAR   struct<System.ReadOnlySpan`1, 16>(AX)(P) V11 tmp9         
                                                                   byref  V11.System.ReadOnlySpan`1[ushort]:_reference (offs=0x00) -> V40 tmp38        
                                                                   int    V11.System.ReadOnlySpan`1[ushort]:_length (offs=0x08) -> V41 tmp39        

as a block copy after that PR and since the destination can be arbitrary heap memory that now requires a helper (that we didn't previously need). Well, not fundamentally, block copying in the backend could be smarter, but currently isn't.

@jakobbotsch
Copy link
Member

I think #80086 tracks making it smarter. The cases that are regressing here seem to end up with helper calls for Span<T>/ReadOnlySpan<T> for the byref field, even though that is unnecessary. Let me try to fix that instead.

@jakobbotsch
Copy link
Member

jakobbotsch commented May 15, 2023

Even with #80086 fixed we end up failling back to movsq inside xarch's genCodeForCpObj, which appears to be very slow. Looking at the System.Memory.ReadOnlySpan benchmark and comparing base (current main):

       mov      rdi, rbx
       lea      rsi, bword ptr [rsp+30H]
       call     CORINFO_HELP_ASSIGN_BYREF
       movsq

to diff (#80086 fixed)

G_M48932_IG12:  ;; offset=00B5H
       mov      rdi, rbx
       lea      rsi, bword ptr [rsp+30H]
       movsq
       movsq

to diff2 (morphing to field-by-field copy, original codegen)

       mov      rax, bword ptr [rsp+30H]
       mov      bword ptr [rsi], rax
       mov      eax, dword ptr [rsp+38H]
       mov      dword ptr [rsi+08H], eax

gives me

Method Job Toolchain input Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
Trim Job-FJFRPI base\corerun.exe 4.2079 ns 0.0115 ns 0.0107 ns 4.2093 ns 4.1878 ns 4.2257 ns 1.00 - NA
Trim Job-OPHUHH diff\corerun.exe 4.4444 ns 0.0294 ns 0.0275 ns 4.4376 ns 4.4005 ns 4.4931 ns 1.06 - NA
Trim Job-IQSHDD diff2\corerun.exe 0.8768 ns 0.0117 ns 0.0104 ns 0.8727 ns 0.8663 ns 0.8998 ns 0.21 - NA

The movsq inefficiency looks like #7469... we should probably reprioritize that issue given that Span<T>/ReadOnlySpan<T> falls into the category of "needs atomic field copies but does not require write barrier".

For the time being I will just revert parts of #85620 by allowing field-by-field morphing when the destination is potential heap and the source is a local.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 15, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 15, 2023
@jakobbotsch
Copy link
Member

Keeping this open until I can verify the graphs are back to the old perf levels.

@jakobbotsch jakobbotsch reopened this May 17, 2023
@jakobbotsch
Copy link
Member

System.Memory.ReadOnlySpan.Trim is fixed, but there is still a significant regression for System.Tests.Perf_Boolean.TryParse. The codegen diff there is https://www.diffchecker.com/5Qg6aVNW/ which gives the following perf differences on my machine:

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
TryParse Job-PWSREE base\corerun.exe Bogus 4.382 ns 0.0249 ns 0.0233 ns 4.385 ns 4.353 ns 4.427 ns 1.00 0.00 - NA
TryParse Job-QMXGIG diff\corerun.exe Bogus 11.154 ns 0.1018 ns 0.0952 ns 11.147 ns 11.019 ns 11.312 ns 2.55 0.03 - NA

The diff looks like what I would expect, it replaces field-by-field copies using GPR registers with a single SIMD copy. @EgorBo is it expected that this is so much slower than using multiple GPR registers?

@EgorBo
Copy link
Member

EgorBo commented May 19, 2023

System.Memory.ReadOnlySpan.Trim is fixed, but there is still a significant regression for System.Tests.Perf_Boolean.TryParse. The codegen diff there is https://www.diffchecker.com/5Qg6aVNW/ which gives the following perf differences on my machine:

Method Job Toolchain value Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
TryParse Job-PWSREE base\corerun.exe Bogus 4.382 ns 0.0249 ns 0.0233 ns 4.385 ns 4.353 ns 4.427 ns 1.00 0.00 - NA
TryParse Job-QMXGIG diff\corerun.exe Bogus 11.154 ns 0.1018 ns 0.0952 ns 11.147 ns 11.019 ns 11.312 ns 2.55 0.03 - NA
The diff looks like what I would expect, it replaces field-by-field copies using GPR registers with a single SIMD copy. @EgorBo is it expected that this is so much slower than using multiple GPR registers?

I don't see these regressions in the FullPGO win-x64 runs: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_Windows%2010.0.18362_PGOType%3Dfullpgo/AllTestindex.html so maybe it's just some intel erratum issue or something like that?

@jakobbotsch
Copy link
Member

jakobbotsch commented May 19, 2023

I don't see these regressions in the FullPGO win-x64 runs: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/main_x64_Windows%2010.0.18362_PGOType%3Dfullpgo/AllTestindex.html so maybe it's just some intel erratum issue or something like that?

I think you improved it subsequently, but you can see that my "fix" PR only was a minor improvement on the graph, and didn't exactly return it back to previous levels. My table above is from my own machine (5950X).

I think it's likely the same kind of store-forwarding problem that @AndyAyersMS saw recently. Since the Span<T> length is 4 bytes we have a previous store like:

mov      dword ptr [rsp+68H], eax

and then

vmovdqu  xmm0, xmmword ptr [rsp+60H]
vmovdqu  xmmword ptr [rsp+50H], xmm0

is significantly worse compared to the original

	mov      rdx, bword ptr [rsp+60H]	
       mov      bword ptr [rsp+50H], rdx	
       mov      edx, dword ptr [rsp+68H]	
       mov      dword ptr [rsp+58H], edx

. The latter only reads the 4 bytes that were previously written, so there is no stall.

Of course this problem is not limited to structures with GC pointers, so the heuristic in block morphing was really just getting lucky here...

@jakobbotsch
Copy link
Member

OTOH we do zero the full structure in the prolog, so I'm not sure if it's store-forwarding after all (can the CPU piece together two separate stores?). Will try to see if I can check some of the hardware counters.

@EgorBo
Copy link
Member

EgorBo commented May 19, 2023

Maybe for 2 simd loads we recieve a worse penalty for crossing cache line boundary?

Interesting, didn't realize stole-forwarding is such a problem (if it is)

@jakobbotsch
Copy link
Member

Maybe for 2 simd loads we recieve a worse penalty for crossing cache line boundary?

That's also possible, is the penalty supposed to be this large?

I looked at the base/diff in vtune (couldn't get µProf to work). Base is before #85620, diff is the same commit but with #86246 manually applied.

The diff shows:
image

Compared to the base:
image

However, the "loads blocked by store forwarding" does not really show up where I would expect it, it shows up in System.Boolean.TrimWhiteSpaceAndNull:
image

The base has the same exact assembly but no "loads blocked by store forwarding", so maybe there's just some drift or misattribution going on by vtune:
image

Seems odd... let me retry some runs with memory randomization.

@EgorBo
Copy link
Member

EgorBo commented May 19, 2023

We had quite a few regressions in the past where we had no good explanation (becuase codegen was the same) and we ended up blaming code layout in the loader heap (how jitted functions located)/GC.

@jakobbotsch
Copy link
Member

jakobbotsch commented May 19, 2023

image

Seems like with the block copy the CPU is stuck waiting to resolve that compare/conditional branch, the CPI is awful compared to the base. If I then choose to expand specifically that one with field-by-field copies, it improves, but then the following block copy (that I left) becomes the bottle neck.

Sadly we don't really have the framework necessary to analyze this and make a smart decision. So I need to consider whether I should fully revert the change (and accept that we cannot really touch that heuristic) or not.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented May 19, 2023

OTOH we do zero the full structure in the prolog, so I'm not sure if it's store-forwarding after all (can the CPU piece together two separate stores?). Will try to see if I can check some of the hardware counters.

You mean a pattern like (wide-store, narrow-store, wide-load)? It is possible the HW can merge stores I suppose, or maybe forward from multiple outstanding stores, if all this happens in a close sequence. But the commentary on https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store would suggest it is unlikely.

Seems odd... let me retry some runs with memory randomization.

I wonder if you are seeing some kind of severe sample skid. Would not ever expect a narrow (byte) load to be impacted by a store forwarding stall.

@jakobbotsch
Copy link
Member

You mean a pattern like (wide-store, narrow-store, wide-load)? It is possible the HW can merge stores I suppose, or maybe forward from multiple outstanding stores, if all this happens in a close sequence. But the commentary on https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store would suggest it is unlikely.

Yeah, that's what I meant.

I wonder if you are seeing some kind of severe sample skid. Would not ever expect a narrow (byte) load to be impacted by a store forwarding stall.

Seems likely to me also.

@jakobbotsch
Copy link
Member

I'm going to call this last System.Tests.Perf_Boolean.TryParse benchmark fixed by improvements elsewhere (in this case enabling of tiered PGO). I think unifying the logic for types with and without GC pointers is general goodness and if we want to be smarter we should explicitly try to model some of these concerns about overlapping field and block stores.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants