Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -11%] System.Buffers.Tests.ReadOnlySequenceTests<Char>.IterateGetPositionTenSegments #47866

Closed
DrewScoggins opened this issue Feb 4, 2021 · 14 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS ubuntu 18.04
Baseline 2f2593177dafbe702407fe0b7ac156a7829b7ee6
Compare 6cf1b8ec012d52880d46fa4773f60ed52ddc9f3d
Diff Link

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IterateGetPositionTenSegments 63.58 ns 70.75 ns 1.11

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Buffers.Tests.ReadOnlySequenceTests&lt;Char&gt;*'

.

Payloads

Baseline
Compare

Histogram

System.Buffers.Tests.ReadOnlySequenceTests.IterateGetPositionTenSegments

[61.840 ; 64.058) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[64.058 ; 65.542) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[65.542 ; 68.087) | @@@@@@@@@
[68.087 ; 69.574) | 
[69.574 ; 71.007) | @@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Feb 4, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels Feb 4, 2021
@danmoseley
Copy link
Member

No smoking gun but this change is perhaps the most likely relevant in the diff?

f6d8e88

@dotnet/jit-contrib thoughts?

@AndyAyersMS
Copy link
Member

Could be. My guess is that that PR altered inlining and from there could be a number of things that impacted perf.

Someone on codegen should follow up.

@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Threading labels Feb 5, 2021
@danmoseley
Copy link
Member

OK, I've moved it to that area.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Feb 5, 2021

I will take a look. There were some positive diffs for SequenceReader on Windows x64...

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Feb 5, 2021

I studied this method, taken verbatim (except for the type parameter) from the benchmark, with the unix_x64_x64 AltJit:

[MethodImpl(MethodImplOptions.NoInlining)]
private int IterateGetPosition(ReadOnlySequence<char> sequence)
{
    int consume = 0;

    SequencePosition position = sequence.Start;
    int offset = (int)(sequence.Length / 10);
    SequencePosition end = sequence.GetPosition(0, sequence.End);

    while (!position.Equals(end))
    {
        position = sequence.GetPosition(offset, position);
        consume += position.GetInteger();
    }

    return consume;
}

The (new) folding for it kicks in 12 times during the compilation, but does not affect the inlining, and the final assembly diff, while present, is decidedly non-informative for why the regression is there: https://www.diffchecker.com/aAIwTy3K.

Unfortunately, I do not have a Unix environment on which I could run the real benchmark (and get the actual assembly and perf numbers), so I will not be able to provide much more information on this. It looks like this could be related to alignment, but at the same time, the inner loop is very big.

@danmoseley
Copy link
Member

Thank you @SingleAccretion .

I have gotten useful results doing perf measurements on WSL2, if you are interested in that option.

@SingleAccretion
Copy link
Contributor

I will try that option and see how far do I get. It may take a considerable amount of time 😄.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 5, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Feb 5, 2021
@SingleAccretion
Copy link
Contributor

SingleAccretion commented Feb 7, 2021

After having set up WSL2 (Ubuntu LTS 20.04), I have confirmed that the assembly I have obtained via the AltJit is exactly the same as the one that BDN's from-memory disassembler produces: see the diff.

To run the benchmarks, I used the following command line (note that I "restored" BDN's defaults to cut down on noise):

dotnet run -c Release -f net6.0 --iterationTime 500 --maxIterationCount 100  --statisticalTest 3ms --disasm --filter 'System.Buffers.Tests.ReadOnlySequenceTests<char>.IterateGetPositionTenSegments' --corerun "~/source/dotnet/runtime/" "~/source/dotnet/runtime-fix/"

Here the "base" - "runtime", is fc48ad5, "runtime-fix" - f6d8e88. I ran a few benchmarks, back to back, here are the results:

Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 120.0 ns 2.42 ns 3.15 ns 119.1 ns 116.6 ns 128.1 ns 1.02 Same 0.03 778 B
IterateGetPositionTenSegments /runtime/ 117.2 ns 2.26 ns 2.42 ns 116.6 ns 114.0 ns 122.3 ns 1.00 Base 0.00 784 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 114.8 ns 2.27 ns 2.87 ns 115.3 ns 111.2 ns 119.1 ns 1.00 Same 0.04 778 B
IterateGetPositionTenSegments /runtime/ 114.1 ns 2.29 ns 3.14 ns 113.6 ns 110.6 ns 121.8 ns 1.00 Base 0.00 784 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 115.8 ns 2.24 ns 2.10 ns 115.8 ns 111.3 ns 119.4 ns 0.99 Same 0.03 778 B
IterateGetPositionTenSegments /runtime/ 116.4 ns 2.33 ns 2.68 ns 116.2 ns 113.4 ns 122.1 ns 1.00 Base 0.00 784 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 120.9 ns 2.45 ns 3.67 ns 120.9 ns 114.4 ns 128.5 ns 1.09 Same 0.04 778 B
IterateGetPositionTenSegments /runtime/ 112.2 ns 2.26 ns 2.69 ns 110.7 ns 109.8 ns 118.7 ns 1.00 Base 0.00 784 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 118.6 ns 2.41 ns 3.53 ns 116.8 ns 114.9 ns 127.1 ns 1.06 Same 0.02 778 B
IterateGetPositionTenSegments /runtime/ 112.7 ns 2.20 ns 2.86 ns 111.2 ns 110.6 ns 119.3 ns 1.00 Base 0.00 784 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 115.4 ns 2.32 ns 3.01 ns 115.0 ns 111.7 ns 120.5 ns 1.01 Same 0.04 778 B
IterateGetPositionTenSegments /runtime/ 112.8 ns 2.28 ns 4.16 ns 110.1 ns 108.9 ns 121.9 ns 1.00 Base 0.00 784 B

As can be seen, the regression does not reproduce reliably, only sometimes. I've swapped these two lines in the benchmark code:

- int offset = (int)(sequence.Length / 10);
- SequencePosition end = sequence.GetPosition(0, sequence.End);
+ SequencePosition end = sequence.GetPosition(0, sequence.End);
+ int offset = (int)(sequence.Length / 10);

This stabilized things somewhat:

Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 110.8 ns 2.20 ns 2.16 ns 110.6 ns 108.5 ns 114.8 ns 0.99 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 108.5 ns 2.19 ns 4.05 ns 106.3 ns 104.6 ns 117.2 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) Code Size
IterateGetPositionTenSegments /runtime-fix/ 104.3 ns 0.61 ns 0.51 ns 104.3 ns 103.5 ns 105.6 ns 0.99 Same 761 B
IterateGetPositionTenSegments /runtime/ 105.3 ns 0.49 ns 0.46 ns 105.2 ns 104.6 ns 106.1 ns 1.00 Base 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 109.3 ns 0.63 ns 0.49 ns 109.2 ns 108.7 ns 110.3 ns 1.01 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 107.8 ns 2.16 ns 2.49 ns 108.7 ns 104.5 ns 112.5 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 108.3 ns 2.12 ns 1.66 ns 108.6 ns 104.8 ns 110.6 ns 0.96 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 113.0 ns 2.27 ns 2.33 ns 113.6 ns 109.7 ns 117.3 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 108.8 ns 2.10 ns 3.07 ns 109.3 ns 105.1 ns 115.3 ns 1.01 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 107.9 ns 2.17 ns 2.90 ns 106.3 ns 104.6 ns 113.8 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 110.7 ns 2.19 ns 1.94 ns 110.3 ns 108.4 ns 114.4 ns 0.99 Same 0.04 761 B
IterateGetPositionTenSegments /runtime/ 108.2 ns 2.19 ns 3.83 ns 105.5 ns 104.3 ns 116.0 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 111.2 ns 2.23 ns 2.48 ns 110.4 ns 108.3 ns 116.6 ns 1.03 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 107.8 ns 2.15 ns 2.47 ns 106.8 ns 105.0 ns 111.9 ns 1.00 Base 0.00 767 B
Method Toolchain Mean Error StdDev Median Min Max Ratio MannWhitney(3ms) RatioSD Code Size
IterateGetPositionTenSegments /runtime-fix/ 106.2 ns 2.09 ns 2.64 ns 104.9 ns 103.4 ns 111.6 ns 0.99 Same 0.03 761 B
IterateGetPositionTenSegments /runtime/ 107.7 ns 2.17 ns 2.82 ns 106.8 ns 104.5 ns 113.3 ns 1.00 Base 0.00 767 B

My conclusion based on the above data and the fact that the benchmark code is strictly better as it has two less movs, is that this is not a real product regression and the issue can be closed.

This looks and feels like an alignment problem, but aligning a loop this big does not seem like a good idea for the code at large.

@danmoseley
Copy link
Member

Cc @kunalspathak

@JulieLeeMSFT
Copy link
Member

@kunalspathak please check the analysis from @SingleAccretion and see if we can close this issue.

@kunalspathak
Copy link
Member

I will take a look sometime next week.

@kunalspathak
Copy link
Member

I agree with @SingleAccretion . I am pasting the diff screenshot as the diff links above do not work.

image

We have 2 less mov after the change. These mov are not even in a loop so it shouldn't matter much. This test is sensitive to data alignment because it operates on char array that is allocated and passed as an input to the benchmark. Further, it seeks to various positions of the memory in the benchmark.

The benchmark overall history shows slight regression around that time, but the measurement is instable, so I won't rely too much on the numbers. The diff is 7ns which is in the error range. Closing the issue.

image

@danmoseley
Copy link
Member

This test is sensitive to data alignment because it operates on char array that is allocated and passed as an input to the benchmark.

@adamsitnik I am wondering whether this is still the case. I know you added memory randomization in BDN in Jan (dotnet/BenchmarkDotNet#1587) and I guess we pulled this in since. I also see you did dotnet/performance#1587 to move the allocs in this test into GlobalSetup so it would work. Am I right in thinking this is solved now? Not sure how to match the dates to @kunalspathak graph above though.

@AndyAyersMS
Copy link
Member

We have not yet enabled data randomization -- I believe @DrewScoggins is about to turn it on for a few tests so we can get a feel for how it will impact our ability to understand perf in alignment-sensitive tests.

@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
Archived in project
Development

No branches or pull requests

6 participants