Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Linux/x64: Regressions in System.Collections.Perf_LengthBucketsFrozenDictionary #105328

Open
performanceautofiler bot opened this issue Jul 23, 2024 · 11 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Milestone

Comments

@performanceautofiler
Copy link

performanceautofiler bot commented Jul 23, 2024

Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue TigerUbuntu
Baseline 078530371f31d23945990b47314f11d6e9f1f7c1
Compare a3fd0957239a6cf44edee68e9bc0a158f15425c9
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Collections.Perf_LengthBucketsFrozenDictionary

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
46.48 μs 50.63 μs 1.09 0.04 False
1.13 μs 1.99 μs 1.76 0.22 False
100.12 ns 181.76 ns 1.82 0.01 True
100.14 ns 180.20 ns 1.80 0.01 True
10.08 ns 19.76 ns 1.96 0.03 True
1.19 μs 1.98 μs 1.66 0.05 True
9.94 ns 20.07 ns 2.02 0.03 True

graph
graph
graph
graph
graph
graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Collections.Perf_LengthBucketsFrozenDictionary*'

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_True_FrozenDictionary(Count: 10000, ItemsPerBucket: 1)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 1000, ItemsPerBucket: 5)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 100, ItemsPerBucket: 1)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 100, ItemsPerBucket: 5)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 10, ItemsPerBucket: 1)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 1000, ItemsPerBucket: 1)

ETL Files

Histogram

JIT Disasms

System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 10, ItemsPerBucket: 5)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-x64 os-linux Linux OS (any supported distro) runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels Jul 23, 2024
@LoopedBard3 LoopedBard3 transferred this issue from dotnet/perf-autofiling-issues Jul 23, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 23, 2024
@LoopedBard3 LoopedBard3 changed the title [Perf] Linux/x64: 11 Regressions on 7/18/2024 10:05:20 PM [Perf] Linux/x64: Regressions in System.Collections.Perf_LengthBucketsFrozenDictionary Jul 23, 2024
@LoopedBard3
Copy link
Member

LoopedBard3 commented Jul 23, 2024

Likely caused by: #105131, linked improvements in PR.

@LoopedBard3
Copy link
Member

LoopedBard3 commented Jul 23, 2024

@jakobbotsch jakobbotsch self-assigned this Jul 23, 2024
@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 24, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 24, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jul 24, 2024
@jakobbotsch jakobbotsch added Priority:2 Work that is important, but not critical for the release and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 25, 2024
@jakobbotsch
Copy link
Member

Report for the regressions in the win-x64 issues above: https://github.com/jakobbotsch/perf-diff-finder/blob/main/strengthreduction/results.md

Of the 15 benchmarks mentioned in those issues only 3 of them have actual strength reductions in a hot method:

  • System.Tests.Perf_String.IndexerCheckBoundCheckHoist
  • System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 10000, ItemsPerBucket: 1)
  • System.Linq.Tests.Perf_Enumerable.SelectToList(input: IEnumerable)

@jakobbotsch
Copy link
Member

The regressions from 7/15 are likely caused by #104603 instead.

@jakobbotsch
Copy link
Member

The larger regressions in this issue (like System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 10, ItemsPerBucket: 5)) seem to be only on some particular Intel CPUs, so I am going to guess it is JCC erratum (indeed it looks like there is a cmp/jge pair straddling a 32 byte boundary in a hot block in the diff).

For IndexerCheckBoundCheckHoist I am seeing some AMD CPU regressions (not totally clear, I'm seeing some bimodality after the strength reduction) while I see improvements on Intel/ARM64 CPUs on the benchmark.

@stephentoub
Copy link
Member

so I am going to guess it is JCC erratum

Do we have a plan for when we're going to tackle that in general?

@jakobbotsch
Copy link
Member

so I am going to guess it is JCC erratum

Do we have a plan for when we're going to tackle that in general?

It was in our initial plans for .NET 9 (#93243), but ended up not making it. Personally I hope we can solve this with new perf lab hardware instead... I believe the CPUs on these queues are i7-8700's that are 7 years old at this point. I think it's been a while since the last Intel CPU with the problem was available for sale.

@jakobbotsch
Copy link
Member

jakobbotsch commented Aug 14, 2024

Oddly the IndexerCheckBoundCheckHoist regression does not reproduce on my own AMD CPU:

BenchmarkDotNet v0.13.13-nightly.20240311.145, Windows 10 (10.0.19045.4651/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.6.24328.19
  [Host]     : .NET 9.0.0 (9.0.24.32707), X64 RyuJIT AVX2
  Job-QZGOXF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-VNPDCN : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Job Toolchain Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
IndexerCheckBoundCheckHoist Job-QZGOXF \41e02e5549a7954b068f342cd9bd6611ef764cc3\corerun.exe 15.44 ns 0.018 ns 0.015 ns 15.44 ns 15.41 ns 15.47 ns 0.99 - NA
IndexerCheckBoundCheckHoist Job-VNPDCN \f9c08462ba81a2aacc9fd85d321e12080c17b9ee\corerun.exe 15.54 ns 0.029 ns 0.027 ns 15.53 ns 15.51 ns 15.60 ns 1.00 - NA

The multi-config graph looks like this:
image

Does not seem like there is anything actionable there.

@jakobbotsch
Copy link
Member

Here's a report for the 13 win-x64 regressions in dotnet/perf-autofiling-issues#38686 and dotnet/perf-autofiling-issues#38716 from July 15th around #104603: https://github.com/jakobbotsch/perf-diff-finder/blob/main/straighten/results.md

Lots of large-scale diffs there which makes it hard. Some notes from just looking at the codegen difference:

System.Reflection.Invoke.Property_Set_int

Large scale diffs that I haven't tried digging into yet.

System.Collections.IterateForEach<Int32>.ImmutableHashSet(Size: 512)

Lots of different block ordering, but haven't dug deeper.

System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>.Enumerate(Size: 10)

Diff here is a new IV widening leading to the removal of a zero extension. The regression is only seen on AMD EPYC CPUs it looks like.

Perf_Enumerable.Repeat

Only diff is removal of an unnecessary compare. Multi-config graph does not show a regression for any other config.

System.Linq.Tests.Perf_Enumerable.SelectToList(input: IEnumerable)

Same as above.

System.Linq.Tests.Perf_OrderBy.OrderByCustomComparer(NumberOfPeople: 512)

Lots of different block ordering, but haven't dug deeper.

System.Collections.CtorFromCollection<Int32>.LinkedList(Size: 512)

Only diff is removal of an unnecessary compare and some different registers allocated.

System.Text.Json.Serialization.Tests.ReadJson<MyEventsListerViewModel>.DeserializeFromReader(Mode: Reflection)

Returned back to normal.

Microsoft.Extensions.Primitives.StringSegmentBenchmark.Equals_Object_Valid

This looks bimodal.

System.Text.RegularExpressions.Tests.Perf_Regex_Common.ReplaceWords(Options: None)

Large scale differences in block layout / register allocation. Haven't dug deeper into why, but presumably that's just caused by different flow graph layouts caused by the PR. Doesn't look like there's a regression here from .NET 9.

System.Text.RegularExpressions.Tests.Perf_Regex_Common.Date_IsNotMatch(Options: None)

Like above.

System.Tests.Perf_String.Concat_CharEnumerable

Also large diffs here, but I don't see an overall .NET 9 regression.

@jakobbotsch
Copy link
Member

System.Reflection.Invoke.Property_Set_int has returned back to normal now. Based on the report above, I believe the only benchmarkss that require some more investigation at this point are System.Collections.IterateForEach<Int32>.ImmutableHashSet(Size: 512) and System.Linq.Tests.Perf_OrderBy.OrderByCustomComparer(NumberOfPeople: 512).
For ImmutableHashSet it looks like thebenchmark did not regress on the same CPU on Ubuntu:
image

The first regression around 10th of June is from enabling CET which is Windows only, so that makes sense. The regression around July 15th is the one that on first view would be caused by #104603, but given no regression on Ubuntu it does not seem like the regression is inherent to the codegen.

For OrderByCustomComparer I arrive at a similar conclusion... the same CPU on other OS's did not regress (and, in fact, somewhat improved):
image
So seems like this isn't an inherent problem with the codegen either.

I don't think there's much more actionable here... let's revisit in .NET 10 with more flowgraph work.

@jakobbotsch jakobbotsch modified the milestones: 9.0.0, 10.0.0 Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime
Projects
None yet
Development

No branches or pull requests

5 participants