Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Regressions from Profile Maintenance and Block Layout Changes #113108

Open
amanasifkhalid opened this issue Mar 4, 2025 · 21 comments
Open
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@amanasifkhalid
Copy link
Member

Part of #107749. I've collated all the regression reports assigned to me since #103450 was merged. For now, I'm excluding reports on Tiger machine configs to skip JCC errata-related regressions.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 4, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 4, 2025
@amanasifkhalid amanasifkhalid added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 4, 2025
@amanasifkhalid amanasifkhalid self-assigned this Mar 4, 2025
@amanasifkhalid amanasifkhalid added this to the 10.0.0 milestone Mar 4, 2025
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Mar 4, 2025

Notes Recent Score Orig Score arm64 ubuntu Surface Windows Viper Ubuntu Viper Windows Benchmark
#111736 2.65 2.67 1.76
1.79
3.99
3.98
System.Memory.Span(Byte).Reverse(Size: 512)
Unrelated 2.00 2.00 2.00
2.00
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj
#111736 1.97 1.95 1.43
1.42
2.72
2.68
System.Memory.Span(Char).Reverse(Size: 512)
#111736 1.84 1.85 1.68
1.68
2.02
2.04
System.Memory.Span(Int32).Reverse(Size: 512)
Unrelated 1.81 1.81 1.81
1.81
System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_FirstElementMatches(input: Array)
Noisy 1.70 2.03 1.70
2.03
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: True, TestRandomAccess: True, TestCase: Json400KB)
#111736 1.56 1.06 1.56
1.06
System.Text.Json.Tests.Perf_Reader.ReadMultiSpanSequenceEmptyLoop(IsDataCompact: False, TestCase: DeepTree)
Unrelated 1.48 1.46 0.98
1.20
3.14
2.37
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512)
Noisy 1.45 1.12 1.45
1.12
System.Text.Json.Node.Tests.Perf_ParseThenWrite.ParseThenWrite(IsDataIndented: True, TestCase: Json400KB)
#111915 1.43 1.47 1.43
1.47
System.Reflection.Invoke.StaticMethod5_arrayNotCached_int_string_struct_class_bool
#111736 1.39 1.39 1.39
1.39
System.Memory.Span(Byte).IndexOfValue(Size: 512)
#111736 1.38 1.39 1.38
1.39
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Int32).BitwiseAnd_Scalar(BufferLength: 128)
#111736 1.35 1.31 1.35
1.31
System.Numerics.Tests.Perf_BigInteger.Equals(arguments: 259 bytes, Same)
#112041 1.33 1.33 1.33
1.33
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_Vectors(BufferLength: 3079)
#112041 1.33 1.33 1.33
1.33
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_ScalarExponent(BufferLength: 3079)
#112041 1.33 1.33 1.33
1.33
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_ScalarBase(BufferLength: 3079)
#112041 1.33 1.32 1.33
1.32
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_Vectors(BufferLength: 128)
#111736 1.33 1.33 1.33
1.33
Benchstone.BenchI.BubbleSort.Test
#112041 1.32 1.32 1.32
1.32
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_ScalarExponent(BufferLength: 128)
#112041 1.32 1.32 1.32
1.32
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Pow_ScalarBase(BufferLength: 128)
#111736 1.32 1.33 1.32
1.33
System.Memory.Span(Int32).LastIndexOfValue(Size: 512)
1.32 1.26 1.32
1.26
System.Memory.Span(Int32).SequenceEqual(Size: 512)
1.31 1.30 1.31
1.30
SciMark2.kernel.benchmarkLU
#103450 1.30 1.29 1.30
1.29
System.Memory.Span(Int32).IndexOfAnyFourValues(Size: 512)
#111736 1.27 1.27 1.27
1.27
System.Memory.Span(Byte).StartsWith(Size: 512)
#112041 1.26 1.26 1.26
1.26
System.MathBenchmarks.Single.AtanPi
#111736 1.25 1.28 1.25
1.28
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Truncate(BufferLength: 128)
#111736 1.25 1.18 1.25
1.18
Microsoft.Extensions.Primitives.StringSegmentBenchmark.Equals_Object_Valid
#111736 1.25 1.24 1.41
1.42
1.11
1.09
System.Buffers.Text.Tests.Base64Tests.ConvertToBase64CharArray(NumberOfBytes: 1000)
Unrelated 1.25 1.60 1.25
1.60
System.Threading.Tasks.ValueTaskPerfTest.CreateAndAwait_FromResult
#110277 1.25 1.25 1.24
1.25
1.25
1.25
System.Collections.IterateForNonGeneric(String).ArrayList(Size: 512)
#111736 1.25 1.56 1.25
1.56
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Int32).Max_Scalar(BufferLength: 128)
#111915 1.25 1.22 1.25
1.22
System.Collections.Tests.Add_Remove_SteadyState(String).Queue(Count: 512)
#111736 1.24 1.24 1.24
1.24
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).Add_Scalar(BufferLength: 128)
#103450 1.24 1.25 1.24
1.25
System.Memory.Span(Int32).IndexOfAnyFiveValues(Size: 512)
#111915 1.24 1.23 1.24
1.24
1.23
1.23
System.Collections.IterateForNonGeneric(Int32).ArrayList(Size: 512)
#111736 1.24 1.23 1.24
1.23
System.Memory.Span(Byte).SequenceEqual(Size: 512)
#111736 1.23 1.28 1.23
1.28
System.Memory.Span(Char).SequenceEqual(Size: 512)
#111047 1.23 1.22 1.23
1.22
System.Reflection.Invoke.Field_SetStatic_int
#112041 1.22 1.22 1.22
1.22
System.MathBenchmarks.Single.AcosPi
Regressed by 3-opt and profile changes 1.21 1.31 1.13
1.13
1.29
1.52
Benchstone.BenchI.IniArray.Test
#110277 1.20 1.20 1.20
1.20
System.Tests.Perf_Char.Char_ToUpperInvariant(input: "Hello World!")
#110277 1.19 1.20 1.18
1.16
1.21
1.25
Span.Sorting.QuickSortArray(Size: 512)
#112041 1.19 1.19 1.19
1.19
System.MathBenchmarks.Double.AtanPi
#103450 1.18 1.33 1.19
1.11
1.17
1.60
Span.Sorting.BubbleSortSpan(Size: 512)
#111915 1.18 1.15 1.27
1.21
1.09
1.09
System.Collections.IterateForEachNonGeneric(String).ArrayList(Size: 512)
Unrelated 1.17 1.17 1.17
1.17
System.Numerics.Tests.Perf_VectorConvert.Convert_float_int
#110277 1.17 1.19 1.17
1.19
System.Collections.IterateForEach(String).ImmutableSortedSet(Size: 512)
#110277 1.17 1.16 1.15
1.14
1.19
1.18
System.Collections.IterateFor(Int32).IList(Size: 512)
#112041 1.16 1.16 1.16
1.16
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).AtanPi(BufferLength: 128)
1.16 1.87 1.16
1.87
System.Buffers.Tests.RentReturnArrayPoolTests(Byte).ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)
#112041 1.16 1.16 1.16
1.16
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).AtanPi(BufferLength: 3079)
#110277 1.16 1.18 1.06
1.08
1.20
1.26
1.23
1.21
Span.Sorting.QuickSortSpan(Size: 512)
#111736 1.16 1.16 1.16
1.16
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).Divide_Vector(BufferLength: 128)
#111736 1.15 1.15 1.15
1.15
System.Text.Perf_Ascii.ToUpper_Bytes_Chars(Size: 128)
#111736 1.15 1.14 1.15
1.14
System.Tests.Perf_Int64.TryFormat(value: 12345)
1.15 1.14 1.15
1.14
System.Collections.CreateAddAndClear(String).ConcurrentQueue(Size: 512)
#111736 1.15 1.30 1.15
1.30
System.Memory.Span(Byte).IndexOfAnyThreeValues(Size: 33)
#111915 1.14 1.07 1.14
1.07
System.Collections.Tests.Perf_PriorityQueue(Int32, Int32).HeapSort(Size: 1000)
#111915 1.14 1.11 1.14
1.11
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 6, Options: NonBacktracking)
1.14 1.14 1.14
1.14
System.Text.Json.Document.Tests.Perf_EnumerateObject.EnumerateProperties(TestCase: StringProperties)
1.14 1.14 1.14
1.14
Span.IndexerBench.CoveredIndex2(length: 1024)
1.14 1.14 1.07
1.07
1.21
1.21
System.MathBenchmarks.Single.AsinPi
1.14 1.14 1.11
1.11
1.17
1.17
System.MathBenchmarks.Double.AsinPi
1.14 1.13 1.14
1.13
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).AtanPi(BufferLength: 128)
1.14 1.11 1.14
1.11
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 10, Options: NonBacktracking)
1.13 1.14 1.13
1.14
Benchstone.BenchF.Trap.Test
1.13 1.12 1.12
1.11
1.14
1.14
System.MathBenchmarks.Double.AcosPi
1.13 1.20 1.13
1.20
XmlDocumentTests.XmlNodeListTests.Perf_XmlNodeList.GetCount
1.12 1.12 1.12
1.12
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).AtanPi(BufferLength: 3079)
1.12 1.13 1.09
1.06
1.16
1.21
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_Vectors(BufferLength: 3079)
1.12 1.12 1.12
1.12
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Distance(BufferLength: 3079)
1.11 1.11 1.11
1.11
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).Max(BufferLength: 3079)
1.11 1.11 1.11
1.11
System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_False_FrozenDictionary(Count: 10000, ItemsPerBucket: 5)
1.11 1.10 1.11
1.10
Benchmark.GetChildKeysTests.AddChainedConfigurationEmpty
1.11 1.20 1.05
1.06
1.17
1.36
Benchstone.BenchI.BubbleSort2.Test
1.11 1.11 1.11
1.11
System.Text.Perf_Ascii.ToLower_Bytes_Chars(Size: 128)
1.11 1.08 1.11
1.08
LinqBenchmarks.Count00LinqMethodX
1.11 1.09 1.11
1.09
PerfLabTests.DelegatePerf.MulticastDelegateInvoke(length: 1000)
1.10 1.09 1.10
1.09
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).FusedMultiplyAdd_ScalarMultiplier(BufferLength: 128)
1.10 1.52 1.06
1.71
1.15
1.35
System.Collections.IterateFor(String).ImmutableList(Size: 512)
1.10 1.10 1.10
1.10
System.Text.Json.Tests.Perf_Reader.ReadReturnBytes(IsDataCompact: False, TestCase: DeepTree)
1.10 1.20 1.10
1.20
System.Collections.TryGetValueTrue(SmallClass, SmallClass).ImmutableDictionary(Size: 512)
1.10 1.07 1.10
1.07
System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_True_FrozenDictionary(Count: 1000, ItemsPerBucket: 5)
1.10 1.14 1.01
1.13
1.15
1.16
1.14
1.14
SeekUnroll.Test(boxedIndex: 3)
1.10 1.07 1.10
1.07
Benchstone.MDBenchI.MDNDhrystone.Test
1.10 1.07 1.10
1.07
System.Tests.Perf_Version.Parse4
1.09 1.15 1.09
1.15
System.Tests.Perf_Enum.ToString_Format_NonFlags(value: Thursday, format: "f")
1.09 1.07 1.07
1.07
1.12
1.07
System.Tests.Perf_Enum.ToString_Flags(value: 32)
1.09 1.09 1.09
1.09
System.Globalization.Tests.StringSearch.IsPrefix_FirstHalf(Options: (, IgnoreCase, True))
1.09 1.09 1.09
1.09
System.Collections.IterateForEachNonGeneric(Int32).ArrayList(Size: 512)
1.09 1.10 1.09
1.10
Benchstone.BenchI.Array2.Test
1.09 1.05 1.09
1.05
System.IO.Tests.Perf_Path.GetDirectoryName
1.09 1.08 1.09
1.08
LinqBenchmarks.Where00ForX
1.09 1.19 1.20
1.20
0.99
1.17
Span.IndexerBench.CoveredIndex3(length: 1024)
1.09 1.11 1.09
1.11
System.Globalization.Tests.StringSearch.IsPrefix_FirstHalf(Options: (en-US, IgnoreCase, True))
1.09 1.09 1.09
1.09
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).BitwiseAnd_Vector(BufferLength: 128)
1.09 1.11 1.07
1.07
1.11
1.16
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_ScalarExponent(BufferLength: 3079)
1.09 1.09 1.09
1.09
System.Linq.Tests.Perf_Enumerable.SelectToList(input: IList)
1.09 1.07 1.09
1.07
System.Collections.IndexerSet(Int32).SortedDictionary(Size: 512)
1.09 1.11 1.06
1.07
1.11
1.16
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_ScalarBase(BufferLength: 128)
1.08 1.11 1.08
1.11
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).FusedMultiplyAdd_Vectors(BufferLength: 128)
1.08 1.09 1.08
1.09
LinqBenchmarks.Where00LinqQueryX
1.08 1.08 1.08
1.08
System.Memory.Span(Byte).LastIndexOfAnyValues(Size: 512)
1.08 1.09 1.08
1.09
LinqBenchmarks.Where00LinqMethodX
1.08 1.14 1.08
1.14
System.Buffers.Tests.ReadOnlySequenceTests(Byte).IterateGetPositionArray
1.08 1.13 1.08
1.13
System.Buffers.Tests.RentReturnArrayPoolTests(Byte).MultipleSerial(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: True)
1.08 1.12 1.08
1.12
System.Collections.Sort(IntStruct).List(Size: 512)
1.08 1.09 1.08
1.09
System.Collections.IterateForEach(Int32).FrozenDictionary(Size: 512)
1.08 1.08 1.08
1.08
Burgers.Test3
1.08 1.12 1.00
1.13
1.16
1.17
1.07
1.07
SeekUnroll.Test(boxedIndex: 1)
1.07 1.17 1.07
1.17
System.Threading.Tasks.ValueTaskPerfTest.CreateAndAwait_FromCompletedTask
1.07 1.18 1.07
1.18
System.Tests.Perf_Enum.ToString_Format_Flags_Large(value: All, format: "g")
1.07 1.08 1.07
1.08
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).AddMultiply_Vectors(BufferLength: 128)
1.07 1.09 1.07
1.09
System.Memory.Span(Int32).IndexOfAnyThreeValues(Size: 33)
1.07 1.09 1.07
1.09
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 3, Options: NonBacktracking)
1.07 1.10 1.08
1.11
1.07
1.09
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreCase, False))
1.07 1.07 1.07
1.06
1.08
1.08
System.Text.Json.Document.Tests.Perf_ParseThenWrite.ParseThenWrite(IsDataIndented: True, TestCase: DeepTree)
1.07 1.08 1.07
1.08
System.Net.Tests.Perf_WebUtility.Decode_DecodingRequired
1.07 1.07 1.07
1.07
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_ScalarBase(BufferLength: 3079)
1.07 1.07 1.07
1.07
System.Buffers.Tests.SearchValuesByteTests.ContainsAny(Values: "abcdefABCDEF0123456789")
1.07 1.11 1.07
1.11
System.Collections.ContainsTrueComparer(String).HashSet(Size: 512)
1.07 1.08 1.07
1.08
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,no (escaping /) required,16)
1.07 1.06 1.07
1.06
System.Collections.TryGetValueTrue(BigStruct, BigStruct).ConcurrentDictionary(Size: 512)
1.07 1.07 1.07
1.07
Benchstone.BenchI.CSieve.Test
1.07 1.09 1.07
1.06
1.07
1.12
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_Vectors(BufferLength: 128)
1.07 1.07 1.07
1.07
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8(arguments: UnsafeRelaxed,no (escaping /) required,512)
1.07 1.09 1.07
1.10
1.07
1.09
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, IgnoreCase, False))
1.07 1.11 1.07
1.11
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, IgnoreCase, False))
1.07 1.06 1.07
1.06
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Pow_ScalarExponent(BufferLength: 128)
1.07 1.07 1.07
1.07
System.Tests.Perf_Version.TryParse4
1.07 1.27 1.24
1.29
0.92
1.26
System.Tests.Perf_Char.Char_IsLower(input: "Good afternoon, Constable!")
1.07 1.11 1.07
1.11
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreCase, False))
1.07 1.07 1.07
1.07
System.Collections.Tests.Perf_PriorityQueue(Int32, Int32).Dequeue_And_Enqueue(Size: 100)
1.06 1.06 1.06
1.06
System.Tests.Perf_Int128.TryParse(value: "-170141183460469231731687303715884105728")
1.06 1.06 1.06
1.06
System.Text.Perf_Ascii.Equals_Bytes_Chars(Size: 128)
1.06 1.06 1.06
1.06
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Double).Sqrt(BufferLength: 128)
1.06 1.06 1.07
1.06
1.06
1.06
System.Tests.Perf_Int128.ParseSpan(value: "-170141183460469231731687303715884105728")
1.06 1.08 1.06
1.08
System.Memory.Span(Byte).IndexOfAnyTwoValues(Size: 512)
1.06 1.06 1.05
1.06
1.07
1.07
System.Linq.Tests.Perf_Enumerable.Average(input: IEnumerable)
1.06 1.05 1.06
1.05
System.Reflection.Invoke.Property_Set_int
1.06 1.15 1.00
1.08
1.00
1.08
1.20
1.32
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, OrdinalIgnoreCase, False))
1.06 1.15 1.06
1.15
System.Text.Json.Tests.Perf_Reader.ReadReturnBytes(IsDataCompact: True, TestCase: LotsOfStrings)
1.06 1.06 1.06
1.07
1.05
1.06
System.Tests.Perf_Int128.Parse(value: "-170141183460469231731687303715884105728")
1.06 1.06 1.06
1.06
System.MathBenchmarks.Single.ExpM1
1.06 1.06 1.06
1.06
1.05
1.06
System.Tests.Perf_Int128.TryParseSpan(value: "-170141183460469231731687303715884105728")
1.06 1.14 1.06
1.14
System.Tests.Perf_Int32.TryParse(value: "2147483647")
1.06 1.06 1.06
1.06
System.IO.Tests.Perf_StreamWriter.WritePartialCharArray(writeLength: 2)
1.06 1.06 1.06
1.06
System.Text.Json.Tests.Perf_Guids.WriteGuids(Formatted: False, SkipValidation: False)
1.06 1.13 1.06
1.09
1.05
1.17
MicroBenchmarks.Serializers.Xml_ToStream(IndexViewModel).XmlSerializer_
1.06 1.06 1.06
1.06
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Single).IndexOfMax(BufferLength: 128)
1.05 1.05 1.05
1.05
Layout.SearchLoops.LoopReturn
1.05 1.05 1.05
1.05
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Int32).MaxMagnitude_Vector(BufferLength: 128)
1.05 1.05 1.05
1.05
System.MathBenchmarks.Double.ExpM1
1.05 1.10 1.05
1.10
System.Collections.CreateAddAndClear(String).ImmutableArray(Size: 512)
1.05 1.21 1.08
1.08
1.11
1.09
0.97
1.51
System.Memory.Span(Int32).IndexOfAnyFiveValues(Size: 33)
1.04 1.10 1.04
1.10
System.Collections.TryGetValueTrue(SmallClass, SmallClass).SortedList(Size: 512)
1.04 1.08 1.07
1.09
1.02
1.07
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 4, Options: NonBacktracking)
1.04 1.20 1.04
1.20
System.Collections.TryGetValueTrue(SmallClass, SmallClass).ImmutableSortedDictionary(Size: 512)
1.04 1.37 0.98
1.63
1.12
1.32
1.02
1.19
System.Collections.IterateFor(Int32).ImmutableList(Size: 512)
#111736 1.03 1.05 1.03
1.05
System.Memory.Span(Char).Clear(Size: 512)
1.03 1.08 1.03
1.08
System.Linq.Tests.Perf_Enumerable.AppendPrepend(input: IEnumerable)
1.03 1.80 1.03
1.80
System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_LastElementMatches(input: Array)
1.03 1.07 1.03
1.07
System.Diagnostics.Perf_Process.GetCurrentProcess
1.03 1.09 1.03
1.09
System.Text.Json.Serialization.Tests.WriteJson(ClassRecord).SerializeToUtf8Bytes(Mode: Reflection)
1.03 1.11 1.03
1.11
System.Collections.IndexerSet(String).SortedList(Size: 512)
1.03 1.09 1.03
1.09
System.Collections.CreateAddAndRemove(String).List(Size: 512)
1.02 2.08 1.02
2.08
System.Threading.Tests.Perf_Interlocked.CompareExchange_int
1.02 1.10 1.02
1.10
System.Numerics.Tensors.Tests.Perf_NumberTensorPrimitives(Int32).AddMultiply_Vectors(BufferLength: 128)
1.02 1.09 1.02
1.09
Span.Sorting.BubbleSortArray(Size: 512)
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_FirstBitDifferent
1.02 1.08 1.02
1.08
MicroBenchmarks.Serializers.Json_ToStream(Location).DataContractJsonSerializer_
1.02 1.20 1.02
1.20
System.Linq.Tests.Perf_Enumerable.SelectToList(input: Array)
1.02 1.17 1.02
1.17
System.Buffers.Tests.RentReturnArrayPoolTests(Object).ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_Equal
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_VersusZero
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_LastBitDifferent
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_AllBitsDifferent
1.02 1.22 1.02
1.22
System.Security.Cryptography.Primitives.Tests.Performance.Perf_FixedTimeEquals.FixedTimeEquals_256Bit_CascadingErrors
1.01 1.29 1.04
1.60
1.00
1.18
1.01
1.14
System.Collections.IterateFor(Int32).ImmutableSortedSet(Size: 512)
1.01 1.08 1.01
1.08
SciMark2.kernel.benchMonteCarlo
1.01 1.10 1.01
1.10
System.Collections.IterateForEach(String).SortedSet(Size: 512)
1.01 1.09 1.01
1.09
System.Collections.Perf_LengthBucketsFrozenDictionary.TryGetValue_True_FrozenDictionary(Count: 10, ItemsPerBucket: 5)
1.01 1.13 1.00
1.19
1.02
1.07
PerfLabTests.LowLevelPerf.ForeachOverList100Elements
1.01 1.54 1.02
1.48
1.00
1.87
1.00
1.33
Struct.SpanWrapper.WrapperSum

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

Starting from the top, the System.Memory.Span.Reverse regressions are quite large, and were likely introduced by #111736. The only interesting change in that PR is we no longer mark blocks unreachable via normal flow as rarely run in optSetBlockWeights. As far as I can tell, this only makes a difference for handler regions.

Currently, profile synthesis will set the weight of handler region entries to something close to zero, but it won't mark them as rarely run. We probably don't want to change this, because marking them as rarely run would block inlining within handlers, which matters for finally regions. If we keep handler regions from being marked as rarely run after optSetBlockWeights, these blocks will be considered by a handful of later phases, in particular CSE. Considering CSEs in handler blocks is probably a poor use of our CSE budget, and we already treat handler regions as cold elsewhere, but #111736 had a decent number of improvements too, so I'm hesitant to reintroduce the previous behavior.

I wasn't able to repro the regression with EgorBot, but I'll take a look locally.

@amanasifkhalid
Copy link
Member Author

I wasn't able to repro the regression with EgorBot, but I'll take a look locally.

I'm able to repro the regression on my Windows AMD machine, but the difference is quite inconsistent. From run to run, the difference is either just a nanosecond, or over 2x. For example:

| Method  | Job        | Toolchain                                                                         | Size | Mean     | Error     | StdDev    | Median   | Min      | Max      | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------- |----------- |---------------------------------------------------------------------------------- |----- |---------:|----------:|----------:|---------:|---------:|---------:|------:|--------:|----------:|------------:|
| Reverse | Job-IGPZVU | \runtime2\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\corerun.exe | 512  | 8.496 ns | 0.2086 ns | 0.1742 ns | 8.452 ns | 8.308 ns | 8.898 ns |  1.00 |    0.03 |         - |          NA |
| Reverse | Job-DKUPJI | \runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\corerun.exe  | 512  | 7.777 ns | 0.1223 ns | 0.1085 ns | 7.746 ns | 7.630 ns | 8.008 ns |  0.92 |    0.02 |         - |          NA |

| Method  | Job        | Toolchain                                                                         | Size | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | RatioSD | Allocated | Alloc Ratio |
|-------- |----------- |---------------------------------------------------------------------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|----------:|------------:|
| Reverse | Job-SIJWDS | \runtime2\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\corerun.exe | 512  | 21.037 ns | 0.2629 ns | 0.2459 ns | 20.927 ns | 20.858 ns | 21.567 ns |  1.00 |    0.02 |         - |          NA |
| Reverse | Job-MXGQIK | \runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\corerun.exe  | 512  |  8.090 ns | 0.1682 ns | 0.1405 ns |  8.051 ns |  7.973 ns |  8.485 ns |  0.38 |    0.01 |         - |          NA |

I'm failing to find any codegen diffs along the call stack of this benchmark, and the potential fix would likely have its own slew of regressions, so the return on pursuing these regressions likely aren't as high as others that are directly related to layout churn. I'll keep this one in mind for later...

@amanasifkhalid
Copy link
Member Author

PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj regressed and then later improved due to flowgraph/profile churn. Its current regressed status is from #112060, and is tracked in #112657. This is also true for System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_FirstElementMatches(input: Array). I'll remove these from the above chart.

@amanasifkhalid
Copy link
Member Author

System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) improved on both platforms once late profile repair was enabled. On Viper Windows, it regressed again around 2/27/2025. I can't find an issue tracking this regression, but nothing in the commit range is flowgraph-related.

@amanasifkhalid
Copy link
Member Author

I've annotated the most-regressed benchmarks with culprit PRs based on when the last notable regression occurred.

  • JIT: Fix profile maintenance in optSetBlockWeights, funclet creation #111736 comes up much more often than I expected it to; I suspect it's pessimizing CSE quite a bit.
  • There are a handful of PRs that regressed from the initial 3-opt or 4-opt PRs, and didn't budge all that much since. These likely expose the limitations of our cost model's ability to reflect actual performance.
  • Unsurprisingly, several regressions are attributed to changes in profile data. These could be regressions in block layout as well, and/or regressions elsewhere.

I looked at a few of the regressions from #103450, and for these cases, 3-opt is keen to make bottom-tested loops top-tested. For example, here's the initial layout for System.SpanHelpers:IndexOfAny, which is the hot call in System.Memory.Span.IndexOfAnyFourValues and System.Memory.Span.IndexOfAnyFiveValues:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..011)-> BB03(0.2),BB02(0.8)     ( cond )                     i LIR IBC
BB02 [0032]  1       BB01                  0.80   14 [000..001)-> BB03(1)                 (always)                     i LIR IBC hascall gcsafe
BB03 [0033]  2       BB01,BB02             1      18 [000..022)-> BB05(0.2),BB04(0.8)     ( cond )                     i LIR IBC
BB04 [0037]  1       BB03                  0.80   14 [011..012)-> BB05(1)                 (always)                     i LIR IBC hascall gcsafe
BB05 [0038]  2       BB03,BB04             1      18 [011..025)-> BB06(0.99),BB20(0.01)   ( cond )                     i LIR IBC
BB06 [0002]  1       BB05                  0.99   18 [027..03C)-> BB20(0.00595),BB17(0.994)   ( cond )                     i LIR IBC
BB17 [0043]  2       BB06,BB19            37.70  679 [03C..04D)-> BB19(0.195),BB18(0.805) ( cond )                     i LIR IBC bwd
BB18 [0044]  2       BB10,BB17           152.87 2752 [04D..062)-> BB10(0.995),BB09(0.005) ( cond )                     i LIR IBC bwd
BB10 [0007]  1       BB18                152.11 2738 [064..06C)-> BB18(0.805),BB19(0.195) ( cond )                     i LIR IBC bwd
BB19 [0045]  2       BB10,BB17            36.93  665 [06C..074)-> BB17(0.994),BB20(0.00595)   ( cond )                     i LIR IBC bwd
BB09 [0006]  1       BB18                  0.76   14 [062..064)                           (return)                     i LIR IBC
BB20 [0046]  3       BB06,BB05,BB19        0.24    4 [025..114)                           (return)                     i LIR IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Both loops are bottom-tested, and we fall out of the inner loop into the outer loop. 3-opt transforms this into the following:

Initial layout cost: 3037.297884
Running greedy 3-opt pass.
Swapping partitions [BB17, BB10] and [BB19, BB19] (cost change = -106.408088)
Swapping partitions [BB19, BB10] and [BB20, BB09] (cost change = -0.105988)
Final layout cost: 2930.783808
...

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..011)-> BB03(0.2),BB02(0.8)     ( cond )                     i LIR IBC
BB02 [0032]  1       BB01                  0.80   14 [000..001)-> BB03(1)                 (always)                     i LIR IBC hascall gcsafe
BB03 [0033]  2       BB01,BB02             1      18 [000..022)-> BB05(0.2),BB04(0.8)     ( cond )                     i LIR IBC
BB04 [0037]  1       BB03                  0.80   14 [011..012)-> BB05(1)                 (always)                     i LIR IBC hascall gcsafe
BB05 [0038]  2       BB03,BB04             1      18 [011..025)-> BB06(0.99),BB20(0.01)   ( cond )                     i LIR IBC
BB06 [0002]  1       BB05                  0.99   18 [027..03C)-> BB20(0.00595),BB17(0.994)   ( cond )                     i LIR IBC
BB20 [0046]  3       BB06,BB05,BB19        0.24    4 [025..114)                           (return)                     i LIR IBC
BB09 [0006]  1       BB18                  0.76   14 [062..064)                           (return)                     i LIR IBC
BB19 [0045]  2       BB10,BB17            36.93  665 [06C..074)-> BB17(0.994),BB20(0.00595)   ( cond )                     i LIR IBC bwd
BB17 [0043]  2       BB06,BB19            37.70  679 [03C..04D)-> BB19(0.195),BB18(0.805) ( cond )                     i LIR IBC bwd
BB18 [0044]  2       BB10,BB17           152.87 2752 [04D..062)-> BB10(0.995),BB09(0.005) ( cond )                     i LIR IBC bwd
BB10 [0007]  1       BB18                152.11 2738 [064..06C)-> BB18(0.805),BB19(0.195) ( cond )                     i LIR IBC bwd
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

It's the same case for Span.Sorting.BubbleSortSpan:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        100 [000..00A)-> BB02(1)                 (always)                     i LIR IBC
BB02 [0001]  2       BB01,BB11           168.07   16807 [00A..010)-> BB11(0.00597),BB10(0.994)   ( cond )                     i LIR IBC bwd bwd-target
BB10 [0009]  2       BB02,BB05           27969. 2796903 [010..026)-> BB05(0.0455),BB04(0.954)  ( cond )                     i LIR IBC bwd
BB04 [0003]  1       BB10                26696. 2669566 [026..052)-> BB05(1)                 (always)                     i LIR IBC bwd
BB05 [0004]  2       BB04,BB10           27969. 2796903 [052..05A)-> BB10(0.994),BB11(0.00597)   ( cond )                     i LIR IBC bwd
BB11 [0010]  2       BB02,BB05           168.07   16807 [05A..061)-> BB02(0.994),BB08(0.00595)   ( cond )                     i LIR IBC bwd
BB08 [0007]  1       BB11                  1.00     100 [061..062)                           (return)                     i LIR IBC
BB12 [0011]  0                             0            [???..???)                           (throw )                     i LIR rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Initial layout cost: 2924440.995680
Running greedy 3-opt pass.
Swapping partitions [BB10, BB04] and [BB05, BB05] (cost change = -77218.213563)
Swapping partitions [BB02, BB04] and [BB11, BB11] (cost change = -16506.722689)
Final layout cost: 2830716.059428

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1        100 [000..00A)-> BB02(1)                 (always)                     i LIR IBC
BB11 [0010]  2       BB02,BB05           168.07   16807 [05A..061)-> BB02(0.994),BB08(0.00595)   ( cond )                     i LIR IBC bwd
BB02 [0001]  2       BB01,BB11           168.07   16807 [00A..010)-> BB11(0.00597),BB10(0.994)   ( cond )                     i LIR IBC bwd bwd-target
BB05 [0004]  2       BB04,BB10           27969. 2796903 [052..05A)-> BB10(0.994),BB11(0.00597)   ( cond )                     i LIR IBC bwd
BB10 [0009]  2       BB02,BB05           27969. 2796903 [010..026)-> BB05(0.0455),BB04(0.954)  ( cond )                     i LIR IBC bwd
BB04 [0003]  1       BB10                26696. 2669566 [026..052)-> BB05(1)                 (always)                     i LIR IBC bwd
BB08 [0007]  1       BB11                  1.00     100 [061..062)                           (return)                     i LIR IBC
BB12 [0011]  0                             0            [???..???)                           (throw )                     i LIR rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

For hot loops, 3-opt finds it massively profitable to create fallthrough along a loop backedge, since the entry edges are considerably lower in weight, but this gives a worse loop shape overall. I think we'd be justified in inhibiting the creation of fallthrough along loop backedges if it would make the loop top-tested. A targeted implementation would be to detect loop backedges in ThreeOptLayout::ConsiderEdge, and if the edge is a conditional jump, we don't consider it. I'd prefer ConsiderEdge continue to only check layout invariants and not introduce its own profitability checks, so we could also try tweaking the cost model to give loop backedges a cost of zero such that we don't try to create fallthrough along them, but I don't know all of the ramifications of that just yet.

@amanasifkhalid
Copy link
Member Author

By restricting 3-opt from creating fallthrough along loop backedges, and restricting hot jump compaction from considering loop exit edges (#113101), layout is looking better. For Span.Sorting.BubbleSortSpan:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..011)-> BB03(0.2),BB02(0.8)     ( cond )                     i LIR IBC
BB02 [0032]  1       BB01                  0.80   14 [000..001)-> BB03(1)                 (always)                     i LIR IBC hascall gcsafe
BB03 [0033]  2       BB01,BB02             1      18 [000..022)-> BB05(0.2),BB04(0.8)     ( cond )                     i LIR IBC
BB04 [0037]  1       BB03                  0.80   14 [011..012)-> BB05(1)                 (always)                     i LIR IBC hascall gcsafe
BB05 [0038]  2       BB03,BB04             1      18 [011..025)-> BB06(0.99),BB20(0.01)   ( cond )                     i LIR IBC
BB06 [0002]  1       BB05                  0.99   18 [027..03C)-> BB20(0.00595),BB17(0.994)   ( cond )                     i LIR IBC
BB17 [0043]  2       BB06,BB19            37.95  683 [03C..04D)-> BB19(0.196),BB18(0.804) ( cond )                     i LIR IBC bwd
BB18 [0044]  2       BB10,BB17           152.57 2746 [04D..062)-> BB10(0.995),BB09(0.005) ( cond )                     i LIR IBC bwd
BB10 [0007]  1       BB18                151.81 2733 [064..06C)-> BB18(0.804),BB19(0.196) ( cond )                     i LIR IBC bwd
BB19 [0045]  2       BB10,BB17            37.19  669 [06C..074)-> BB17(0.994),BB20(0.00595)   ( cond )                     i LIR IBC bwd
BB20 [0046]  3       BB06,BB05,BB19        0.24    4 [025..114)                           (return)                     i LIR IBC
BB09 [0006]  1       BB18                  0.76   14 [062..064)                           (return)                     i LIR IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I'm sure these two changes will incur pretty big diffs, but I think they're focused enough that (hopefully) the effects will be predictable. Checking if an edge is a loop exit/backedge requires iterating up to every loop edge in the method, so we don't want to do this lookup every time in ConsiderEdge. We should have space to cache this information in FlowEdge without inflating its size.

@amanasifkhalid
Copy link
Member Author

With #113101, some of the 4-opt regressions have layouts that are further away from a locally-minimal cost. For example, without the change, this is what the layout of Span.Sorting.QuickSortSpan looks like:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0016]  1                             1     2667265 [???..???)-> BB07(1)                 (always)                     i LIR IBC internal
BB05 [0003]  1       BB07                  6.51 17354584 [028..02C)-> BB06(1)                 (always)                     i LIR IBC bwd bwd-target
BB06 [0004]  2       BB05,BB12             8.64 23046467 [02C..030)-> BB08(0.0478),BB07(0.952)  ( cond )                     i LIR IBC bwd bwd-target
BB07 [0005]  2       BB06,BB01             9.20 24549367 [030..03D)-> BB05(0.707),BB08(0.293) ( cond )                     i LIR IBC bwd bwd-src osr-entry
BB08 [0006]  3       BB06,BB07,BB09       10.38 27679929 [03D..047)-> BB14(0.0918),BB10(0.908)  ( cond )                     i LIR IBC bwd
BB10 [0009]  1       BB08                  9.43 25140181 [047..054)-> BB09(0.774),BB12(0.226) ( cond )                     i LIR IBC bwd bwd-src
BB09 [0007]  1       BB10                  7.29 19448297 [03F..043)-> BB08(1)                 (always)                     i LIR IBC bwd bwd-target
BB12 [0011]  1       BB10                  2.13  5691884 [058..080)-> BB06(1)                 (always)                     i LIR IBC bwd
BB14 [0013]  1       BB08                  0.95  2539748 [084..088)-> BB16(0.212),BB15(0.788) ( cond )                     i LIR IBC
BB15 [0014]  1       BB14                  0.75  2002351 [088..0A9)-> BB16(1)                 (always)                     i LIR IBC
BB16 [0015]  2       BB14,BB15             0.95  2539748 [0A9..0AA)-> BB17(1),BB20(0)         ( cond )                     i LIR IBC
BB17 [0019]  1       BB16                  0.95  2539748 [0A9..0AA)-> BB19(0.2),BB18(0.8)     ( cond )                     i LIR IBC
BB18 [0024]  1       BB17                  0.76  2031798 [0A9..0AA)-> BB19(1)                 (always)                     i LIR IBC hascall gcsafe
BB19 [0025]  2       BB17,BB18             0.95  2539748 [0A9..0B8)-> BB21(1),BB20(0)         ( cond )                     i LIR IBC hascall gcsafe
BB21 [0029]  1       BB19                  0.95  2539748 [0B7..0B8)-> BB23(0.2),BB22(0.8)     ( cond )                     i LIR IBC
BB22 [0034]  1       BB21                  0.76  2031798 [0B7..0B8)-> BB23(1)                 (always)                     i LIR IBC hascall gcsafe
BB23 [0035]  2       BB21,BB22             0.95  2539748 [00A..0C7)                           (return)                     i LIR IBC hascall gcsafe
BB20 [0028]  2       BB16,BB19             0           0 [0B7..0B8)                           (throw )                     i LIR IBC rare hascall gcsafe
BB24 [0037]  0                             0           0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

With it, instead of BB07 falling into the loop dominated by BB08, BB06 falls into it, despite the latter's edge into BB08 being colder:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0016]  1                             1     2696193 [???..???)-> BB07(1)                 (always)                     i LIR IBC internal
BB07 [0005]  2       BB06,BB01             9.76 26319504 [030..03D)-> BB05(0.707),BB08(0.293) ( cond )                     i LIR IBC bwd bwd-src osr-entry
BB05 [0003]  1       BB07                  6.90 18595325 [028..02C)-> BB06(1)                 (always)                     i LIR IBC bwd bwd-target
BB06 [0004]  2       BB05,BB12             9.20 24799553 [02C..030)-> BB08(0.0443),BB07(0.956)  ( cond )                     i LIR IBC bwd bwd-target
BB09 [0007]  1       BB10                  7.68 20706461 [03F..043)-> BB08(1)                 (always)                     i LIR IBC bwd bwd-target
BB08 [0006]  3       BB06,BB07,BB09       10.92 29447769 [03D..047)-> BB14(0.0862),BB10(0.914)  ( cond )                     i LIR IBC bwd
BB10 [0009]  1       BB08                  9.98 26910689 [047..054)-> BB09(0.769),BB12(0.231) ( cond )                     i LIR IBC bwd bwd-src
BB12 [0011]  1       BB10                  2.30  6204229 [058..080)-> BB06(1)                 (always)                     i LIR IBC bwd
BB14 [0013]  1       BB08                  0.94  2537079 [084..088)-> BB16(0.231),BB15(0.769) ( cond )                     i LIR IBC
BB15 [0014]  1       BB14                  0.72  1949882 [088..0A9)-> BB16(1)                 (always)                     i LIR IBC
BB16 [0015]  2       BB14,BB15             0.94  2537079 [0A9..0AA)-> BB17(1),BB20(0)         ( cond )                     i LIR IBC
BB17 [0019]  1       BB16                  0.94  2537079 [0A9..0AA)-> BB19(0.2),BB18(0.8)     ( cond )                     i LIR IBC
BB18 [0024]  1       BB17                  0.75  2029663 [0A9..0AA)-> BB19(1)                 (always)                     i LIR IBC hascall gcsafe
BB19 [0025]  2       BB17,BB18             0.94  2537079 [0A9..0B8)-> BB21(1),BB20(0)         ( cond )                     i LIR IBC hascall gcsafe
BB21 [0029]  1       BB19                  0.94  2537079 [0B7..0B8)-> BB23(0.2),BB22(0.8)     ( cond )                     i LIR IBC
BB22 [0034]  1       BB21                  0.75  2029663 [0B7..0B8)-> BB23(1)                 (always)                     i LIR IBC hascall gcsafe
BB23 [0035]  2       BB21,BB22             0.94  2537079 [00A..0C7)                           (return)                     i LIR IBC hascall gcsafe
BB20 [0028]  2       BB16,BB19             0           0 [0B7..0B8)                           (throw )                     i LIR IBC rare hascall gcsafe
BB24 [0037]  0                             0           0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I probably shouldn't be trying to artificially inhibit 3-opt's cost model for such cases, like I am in #113101. For other 4-opt regressions, we have loops with unconditional back edges, and there doesn't seem to be some obviously optimal rotation of the loop. For example, from System.Collections.IterateForNonGeneric<String>.ArrayList:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       7736 [000..00D)-> BB15(0.01),BB27(0.99)   ( cond )                     i LIR IBC
BB27 [0029]  1       BB01                  0.99    7659 [???..???)-> BB15(0.01),BB14(0.99)   ( cond )                     LIR IBC internal
BB14 [0016]  2       BB06,BB27           496.69 3842423 [019..022)-> BB02(0.998),BB30(0.00197)   ( cond )                     i LIR IBC bwd
BB02 [0001]  1       BB14                495.71 3834841 [00D..00E)-> BB31(0),BB04(1)         ( cond )                     i LIR IBC nullcheck bwd bwd-target
BB04 [0011]  1       BB02                495.71 3834841 [00D..00E)-> BB06(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB06 [0013]  1       BB04                495.71 3834841 [00D..019)-> BB14(1)                 (always)                     i LIR IBC idxlen nullcheck bwd
BB15 [0017]  2       BB01,BB27             0.02     154 [???..???)-> BB16(1)                 (always)                     LIR IBC internal
BB16 [0018]  2       BB15,BB25            10.08   78017 [019..019)-> BB17(0),BB18(1)         ( cond )                     i LIR IBC bwd bwd-src
BB18 [0020]  1       BB16                 10.08   78017 [???..???)-> BB19(1)                 (always)                     i LIR IBC internal bwd
BB19 [0021]  2       BB17,BB18            10.08   78017 [019..022)-> BB20(0.998),BB30(0.00197)   ( cond )                     i LIR IBC internal bwd bwd-src
BB20 [0022]  1       BB19                 10.06   77863 [00D..00D)-> BB21(0),BB22(1)         ( cond )                     i LIR IBC bwd bwd-target
BB22 [0024]  1       BB20                 10.06   77863 [00D..00E)-> BB31(0),BB23(1)         ( cond )                     i LIR IBC nullcheck bwd
BB23 [0025]  1       BB22                 10.06   77863 [00D..00E)-> BB24(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB24 [0026]  1       BB23                 10.06   77863 [00D..00E)-> BB25(1)                 (always)                     i LIR IBC idxlen nullcheck bwd
BB25 [0027]  2       BB21,BB24            10.06   77863 [00D..019)-> BB16(1)                 (always)                     i LIR IBC internal bwd
BB30 [0032]  2       BB14,BB19             1.00    7736 [022..024)                           (return)                     i LIR IBC
BB17 [0019]  1       BB16                  0          0 [???..???)-> BB19(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB21 [0023]  1       BB20                  0          0 [???..???)-> BB25(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB31 [0033]  4       BB02,BB04,BB22,BB23   0          0 [00D..00E)                           (throw )                     i LIR IBC rare gcsafe newobj bwd
BB32 [0034]  0                             0          0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In this case, both loop heads BB14 and BB16 are reached by unconditional loop back edges. It might be worthwhile reviving #109346, though since the diffs from that were hard to triage, it may be easier to reintroduce the unconditional-to-conditional branch opts we used to run with the old layout. They sort of duplicate loop inversion, but because they run late in the frontend, diffs should be easier to parse.

amanasifkhalid added a commit that referenced this issue Mar 7, 2025
Part of #113108. When converting an unconditional jump to a test block into a duplicate test block, fgOptimizeBranch flips the condition (unnecessarily -- I plan to fix this someday), which means we need to derive the new true edge's likelihood from the cloned false edge, and vice versa. Now that late profile repair is enabled, if we get this wrong, the flipped edge likelihoods will flatten loop weights, and inflate non-loop weights. I suspect fixing this will also revert some of the regressions from late profile repair.
amanasifkhalid added a commit that referenced this issue Mar 8, 2025
Part of #107749. This is part of my broader goal of phasing out fgReorderBlocks entirely, but there are a few transformations we may want to do right before block layout. In particular, I noticed in #113108 (comment) that some regressed benchmarks have loops with unconditional backedges; in the absence of better loop inversion, we might want to run our unconditional-to-conditional branch duplication optimization right before layout to improve branch behavior in loops. We can put such optimizations in this pre-layout phase.
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Mar 11, 2025

System.Collections.IterateForNonGeneric<T>:ArrayList(Size: 512) is another case where top-tested loops trip 3-opt up, and these benchmarks regressed on arm64, which has shown to be more resilient to layout changes than x64:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       7355 [000..00D)-> BB15(0.01),BB27(0.99)   ( cond )                     i LIR IBC
BB27 [0029]  1       BB01                  0.99    7281 [???..???)-> BB15(0.01),BB14(0.99)   ( cond )                     LIR IBC internal
BB14 [0016]  2       BB06,BB27           498.77 3668423 [019..022)-> BB02(0.998),BB30(0.00197)   ( cond )                     i LIR IBC bwd
BB02 [0001]  1       BB14                497.79 3661215 [00D..00E)-> BB31(0),BB04(1)         ( cond )                     i LIR IBC nullcheck bwd bwd-target
BB04 [0011]  1       BB02                497.79 3661215 [00D..00E)-> BB06(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB06 [0013]  1       BB04                497.79 3661215 [00D..019)-> BB14(1)                 (always)                     i LIR IBC idxlen nullcheck bwd
BB15 [0017]  2       BB01,BB27             0.02     146 [???..???)-> BB16(1)                 (always)                     LIR IBC internal
BB20 [0022]  1       BB19                 10.11   74337 [00D..00D)-> BB21(0),BB22(1)         ( cond )                     i LIR IBC bwd bwd-target
BB22 [0024]  1       BB20                 10.11   74337 [00D..00E)-> BB31(0),BB23(1)         ( cond )                     i LIR IBC nullcheck bwd
BB23 [0025]  1       BB22                 10.11   74337 [00D..00E)-> BB24(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB24 [0026]  1       BB23                 10.11   74337 [00D..00E)-> BB25(1)                 (always)                     i LIR IBC idxlen nullcheck bwd
BB25 [0027]  2       BB21,BB24            10.11   74337 [00D..019)-> BB16(1)                 (always)                     i LIR IBC internal bwd
BB16 [0018]  2       BB15,BB25            10.13   74484 [019..019)-> BB17(0),BB18(1)         ( cond )                     i LIR IBC bwd bwd-src
BB18 [0020]  1       BB16                 10.13   74484 [???..???)-> BB19(1)                 (always)                     i LIR IBC internal bwd
BB19 [0021]  2       BB17,BB18            10.13   74484 [019..022)-> BB20(0.998),BB30(0.00197)   ( cond )                     i LIR IBC internal bwd bwd-src
BB30 [0032]  2       BB14,BB19             1.00    7355 [022..024)                           (return)                     i LIR IBC
BB17 [0019]  1       BB16                  0          0 [???..???)-> BB19(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB21 [0023]  1       BB20                  0          0 [???..???)-> BB25(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB31 [0033]  4       BB02,BB04,BB22,BB23   0          0 [00D..00E)                           (throw )                     i LIR IBC rare gcsafe newobj bwd
BB32 [0034]  0                             0          0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note that for this benchmark, and the above ones with unconventional loop rotations, the external TSP optimizer failed to improve upon 3-opt's layout, so this isn't an algorithmic limitation we're hitting, but a limit with the cost model. I haven't yet tried tweaking the cost model by integrating control penalty constants, though this could be one way forward. Young et al's Near-optimal Intraprocedural Branch Alignment derives constants from the pipeline penalties from a particular ISA -- we don't have an obvious equivalent for our purposes. And if we did, changing the cost model would likely incur significant churn.

I'm increasingly motivated by these examples to implement some late form of loop inversion, where for each unconditional backedge, we consider cloning the loop's condition and making the backedge conditional. fgOptimizeBranch can already do this transformation; we just need to remove its requirement that one of the loop header's targets is right after the back edge's source block.

As a side note, it seems odd that fgOptimizeBranch duplicates much of loop inversion's functionality. I was thinking of fgOptimizeBranch being something we only run right before layout, but a longer-term goal might be for it to replace our current loop inversion implementation so that top-tested loops have an opportunity to match the bottom-tested patterns expected by loop opts.

@amanasifkhalid
Copy link
Member Author

Adding a pre-layout pass of fgOptimizeBranch (with some tweaks) seems to improve loop layout for the 4-opt regressions. Consider this snippet from System.Collections.IterateForNonGeneric<T>:ArrayList(Size: 512), before we've run any pre-layout flow opts:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       6993 [000..00D)-> BB15(0.01),BB27(0.99)   ( cond )                     i IBC
BB27 [0029]  1       BB01                  1       6993 [???..???)-> BB15(0.01),BB14(0.99)   ( cond )                     IBC internal
BB14 [0016]  1       BB27                  1       6993 [???..???)-> BB09(1)                 (always)                     IBC internal
BB02 [0001]  1       BB09                496.22 3470100 [00D..00E)-> BB31(0),BB04(1)         ( cond )                     i IBC nullcheck bwd bwd-target
BB04 [0011]  1       BB02                496.22 3470100 [00D..00E)-> BB06(1),BB31(0)         ( cond )                     i IBC bwd
BB06 [0013]  1       BB04                496.22 3470100 [00D..019)-> BB09(1)                 (always)                     i IBC idxlen nullcheck bwd
BB09 [0002]  2       BB06,BB14           497.21 3477024 [019..022)-> BB02(0.998),BB30(0.00199)   ( cond )                     i IBC bwd bwd-src

Instead of entering the loop from BB14 to check if it will execute at all, fgOptimizeBranch can clone the condition in BB09 and convert BB14 into a pre-loop test. What fgOptimizeBranch currently misses is the newfound opportunity to compact BB06 and BB09. It's a trivial transformation, but it denies 3-opt the opportunity to align the loop with an unconditional jump at the bottom. This incentivizes 3-opt to keep the loop bottom-tested, and move the exit edge's target up as well. Here's the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       6993 [000..00D)-> BB15(0.01),BB27(0.99)   ( cond )                     i LIR IBC
BB27 [0029]  1       BB01                  0.99    6923 [???..???)-> BB15(0.01),BB14(0.99)   ( cond )                     LIR IBC internal
BB14 [0016]  1       BB27                  0.98    6854 [???..???)-> BB30(0.00199),BB02(0.998)   ( cond )                     LIR IBC internal
BB02 [0001]  2       BB06,BB14           491.26 3435399 [00D..00E)-> BB31(0),BB04(1)         ( cond )                     i LIR IBC nullcheck bwd bwd-target
BB04 [0011]  1       BB02                491.26 3435399 [00D..00E)-> BB06(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB06 [0013]  1       BB04                491.26 3435399 [00D..022)-> BB02(0.998),BB30(0.00199)   ( cond )                     i LIR IBC idxlen nullcheck bwd
BB30 [0032]  3       BB06,BB14,BB19        1.00    6993 [022..024)                           (return)                     i LIR IBC
BB15 [0017]  2       BB01,BB27             0.02     139 [???..???)-> BB16(1)                 (always)                     LIR IBC internal
BB16 [0018]  2       BB15,BB25             9.99   69892 [019..019)-> BB17(0),BB18(1)         ( cond )                     i LIR IBC bwd bwd-src
BB18 [0020]  1       BB16                  9.99   69892 [???..???)-> BB19(1)                 (always)                     i LIR IBC internal bwd
BB19 [0021]  2       BB17,BB18             9.99   69892 [019..022)-> BB20(0.998),BB30(0.00199)   ( cond )                     i LIR IBC internal bwd bwd-src
BB20 [0022]  1       BB19                  9.97   69753 [00D..00D)-> BB21(0),BB22(1)         ( cond )                     i LIR IBC bwd bwd-target
BB22 [0024]  1       BB20                  9.97   69753 [00D..00E)-> BB31(0),BB23(1)         ( cond )                     i LIR IBC nullcheck bwd
BB23 [0025]  1       BB22                  9.97   69753 [00D..00E)-> BB24(1),BB31(0)         ( cond )                     i LIR IBC bwd
BB24 [0026]  1       BB23                  9.97   69753 [00D..00E)-> BB25(1)                 (always)                     i LIR IBC idxlen nullcheck bwd
BB25 [0027]  2       BB21,BB24             9.97   69753 [00D..019)-> BB16(1)                 (always)                     i LIR IBC internal bwd
BB17 [0019]  1       BB16                  0          0 [???..???)-> BB19(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB21 [0023]  1       BB20                  0          0 [???..???)-> BB25(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB31 [0033]  4       BB02,BB04,BB22,BB23   0          0 [00D..00E)                           (throw )                     i LIR IBC rare gcsafe newobj bwd
BB32 [0034]  0                             0          0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

System.Collections.IterateFor<Int32>.IList(Size: 512) looks similar:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       7015 [000..00C)-> BB14(0.01),BB25(0.99)   ( cond )                     i LIR IBC
BB25 [0027]  1       BB01                  0.99    6945 [???..???)-> BB14(0.01),BB13(0.99)   ( cond )                     LIR IBC internal
BB13 [0015]  1       BB25                  0.98    6875 [???..???)-> BB28(0.00198),BB02(0.998)   ( cond )                     LIR IBC internal
BB02 [0001]  2       BB05,BB13           495.09 3473035 [00C..00D)-> BB05(1),BB29(0)         ( cond )                     i LIR IBC bwd bwd-target
BB05 [0012]  1       BB02                495.09 3473035 [00C..021)-> BB02(0.998),BB28(0.00198)   ( cond )                     i LIR IBC idxlen bwd
BB28 [0030]  3       BB05,BB13,BB18        1.00    7015 [021..023)                           (return)                     i LIR IBC
BB14 [0016]  2       BB01,BB25             0.02     140 [???..???)-> BB15(1)                 (always)                     LIR IBC internal
BB15 [0017]  2       BB14,BB23            10.07   70656 [018..018)-> BB16(0),BB17(1)         ( cond )                     i LIR IBC bwd bwd-src
BB17 [0019]  1       BB15                 10.07   70656 [???..???)-> BB18(1)                 (always)                     i LIR IBC internal bwd
BB18 [0020]  2       BB16,BB17            10.07   70656 [018..021)-> BB19(0.998),BB28(0.00198)   ( cond )                     i LIR IBC internal bwd bwd-src
BB19 [0021]  1       BB18                 10.05   70517 [00C..00C)-> BB20(0),BB21(1)         ( cond )                     i LIR IBC bwd bwd-target
BB21 [0023]  1       BB19                 10.05   70517 [00C..00D)-> BB22(1),BB29(0)         ( cond )                     i LIR IBC bwd
BB22 [0024]  1       BB21                 10.05   70517 [00C..00D)-> BB23(1)                 (always)                     i LIR IBC idxlen bwd
BB23 [0025]  2       BB20,BB22            10.05   70517 [00C..018)-> BB15(1)                 (always)                     i LIR IBC internal bwd
BB16 [0018]  1       BB15                  0          0 [???..???)-> BB18(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB20 [0022]  1       BB19                  0          0 [???..???)-> BB23(1)                 (always)                     i LIR IBC rare internal hascall gcsafe bwd
BB29 [0031]  2       BB02,BB21             0          0 [00C..00D)                           (throw )                     i LIR IBC rare gcsafe bwd
BB30 [0032]  0                             0          0 [???..???)                           (throw )                     i LIR IBC rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

amanasifkhalid added a commit that referenced this issue Mar 19, 2025
Part of #107749. fgReorderBlocks runs fgOptimizeBranch when it decides not to reorder a particular block. Turning off the old layout in favor of RPO layout in .NET 9 had the unintended consequence of also disabling the later fgOptimizeBranch pass the JIT used to do. After finding cases in #113108 that benefit from cloning and hoisting loop tests, I decided to reintroduce this late pass. These regressions also highlight that fgOptimizeBranch can create compaction opportunities inside loop bodies that we don't currently take advantage of; I've added a check to handle this.
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Mar 21, 2025

#113491 is in, but it's a bit too early to see churn in the benchmarks. I've recomputed the above regression table with a few modifications to help track down the remaining interesting regressions:

  • I've excluded all regression reports related to JIT: Fix profile maintenance in optSetBlockWeights, funclet creation #111736 since I wasn't able to find any codegen changes for the most impacted benchmarks with that change included/reverted (comment).
  • I've also excluded regressions that I previously determined were caused by unrelated changes. For example, the math and Tensor primitive regressions in the original table surfaced from an intrinsic change that coincided with one of my flowgraph changes. These benchmarks are not layout-sensitive and can be ignored for our purposes.
  • For brevity's sake, I'm only looking at benchmarks that remain regressed by 10% or more from their baseline -- we have a pretty long tail of regressions within the margin of noise. Also, some benchmarks are noisy enough that they dip above and below the 10% threshold from run to run, and cannot be easily attributed to any change of mine; I've excluded these as well.

This yields a much shorter table, consisting of regressions smaller in magnitude than the original triage suggested.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Mar 21, 2025

Notes Recent Score Orig Score arm64 Ubuntu Surface Windows Viper Ubuntu Viper Windows Benchmark
#111873 1.32 1.31 1.13
1.13
1.55
1.52
Benchstone.BenchI.IniArray.Test
#103450 1.30 1.29 1.30
1.29
System.Memory.Span(Int32).IndexOfAnyFourValues(Size: 512)
#111915 1.27 1.22 1.27
1.22
System.Collections.Tests.Add_Remove_SteadyState(String).Queue(Count: 512)
#103450 1.25 1.27 1.27
1.29
System.Tests.Perf_Char.Char_IsLower(input: "Good afternoon, Constable!")
#103450 1.24 1.25 1.24
1.25
System.Memory.Span(Int32).IndexOfAnyFiveValues(Size: 512)
#111047 1.23 1.22 1.23
1.22
System.Reflection.Invoke.Field_SetStatic_int
#110277 1.21 1.13 1.08
1.09
1.35
1.17
MicroBenchmarks.Serializers.Xml_ToStream(IndexViewModel).XmlSerializer_
#110277 1.19 1.20 1.19
1.20
System.Tests.Perf_Char.Char_ToUpperInvariant(input: "Hello World!")
#111915 1.18 1.15 1.27
1.21
1.09
1.09
System.Collections.IterateForEachNonGeneric(String).ArrayList(Size: 512)
#110277 1.15 1.19 1.15
1.19
System.Collections.IterateForEach(String).ImmutableSortedSet(Size: 512)
#110277 1.18 1.20 1.15
1.16
1.20
1.25
Span.Sorting.QuickSortArray(Size: 512)
#111915 1.14 1.11 1.14
1.11
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 6, Options: NonBacktracking)
#111915 1.14 1.07 1.14
1.07
System.Collections.Tests.Perf_PriorityQueue(Int32, Int32).HeapSort(Size: 1000)
#110277 1.14 1.07 1.14
1.07
System.Collections.IndexerSet(Int32).SortedDictionary(Size: 512)
#103450 1.14 1.14 1.14
1.14
Span.IndexerBench.CoveredIndex2(length: 1024)
#110277 1.14 1.18 1.07
1.08
1.21
1.26
1.14
1.21
Span.Sorting.QuickSortSpan(Size: 512)
#110277 1.13 1.14 1.13
1.14
Benchstone.BenchF.Trap.Test
#111915 1.12 1.11 1.12
1.11
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 10, Options: NonBacktracking)
#111047 1.11 1.12 1.11
1.12
System.Collections.Sort(IntStruct).List(Size: 512)
#110277 1.10 1.10 1.10
1.10
Benchmark.GetChildKeysTests.AddChainedConfigurationEmpty
#110277 1.10 1.14 1.01
1.13
1.16
1.16
1.13
1.14
SeekUnroll.Test(boxedIndex: 3)
#111915 1.10 1.07 1.10
1.07
1.09
1.07
System.Tests.Perf_Enum.ToString_Flags(value: 32)

@amanasifkhalid
Copy link
Member Author

For System.Reflection.Invoke.StaticMethod5_arrayNotCached_int_string_struct_class_bool, I'm not seeing any codegen diffs locally with profile repair disabled. The only interesting behavior I've noticed is sometimes we never make it from Tier1-OSR to Tier1. Perhaps fluctuation in time to Tier1 explains some of the spikiness of this benchmark on Viper Windows.

I looked at the ADX query for this benchmark to see if profile repair regressed it on other platforms, and I didn't see anything interesting. However, the win-amd configuration improved around the time #113491 went in, so I suspect this regression will be closed soon:

Image

@amanasifkhalid
Copy link
Member Author

As I anticipated, the System.Collections.IterateForNonGeneric<T>.ArrayList(Size: 512) regressions look better now that #113491 is in. I'm going to consider those regressions fixed.

Image

@amanasifkhalid
Copy link
Member Author

#113491 resolves System.Collections.IterateFor<Int32>.IList(Size: 512) on Viper Windows (bottom), and almost closes the gap on Linux Ampere (top):

Image

In the latter case, it looks like 3-opt is failing to model some architecture-specific behavior, since I don't see similarly stubborn regressions on other platforms. This seems to be a theme for layout regressions: one platform regresses while another improves. Perhaps we could leverage hardware vendors' optimization guides to develop platform-specific cost models, though that's not in the cards for .NET 10.

For what it's worth, Tiger machines improved from #113491 as well:

Image

@amanasifkhalid
Copy link
Member Author

System.Collections.IterateForEach<Int32>.ImmutableSortedSet(Size: 512) was improved by 3-opt, but enabling 4-opt destabilized it. It recently stabilized at a worse baseline, but the commit range doesn't show anything layout-related. I don't see anything actionable here.

@amanasifkhalid
Copy link
Member Author

#111873 destabilized Benchstone.BenchI.IniArray.Test by changing how fgCalledCount is computed. For this particular benchmark, fgCalledCount looks about the same as before, but subtle rounding errors can sometimes change relative block weight computations during CSE. When this rounding matters, we see the benchmark regress, and when it doesn't, we see it return to its baseline.

Either way, the layout for this benchmark looks good:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight      IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       100 [000..00C)-> BB02(1)                 (always)                     i LIR IBC hascall newarr
BB02 [0001]  2       BB01,BB20           158.67  15867 [00C..010)-> BB03(1)                 (always)                     i LIR IBC idxlen bwd bwd-target
BB03 [0002]  2       BB02,BB03            2605. 260544 [010..01E)-> BB03(0.94),BB20(0.0597) ( cond )                     i LIR IBC idxlen bwd bwd-target
BB20 [0019]  1       BB03                157.09  15709 [01E..02A)-> BB02(0.994),BB13(0.00595)   ( cond )                     i LIR IBC bwd
BB13 [0012]  1       BB20                  0.99     99 [02A..02C)                           (return)                     i LIR IBC
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

#112151 may address the CSE regressions; I'll continue this work with that PR.

@amanasifkhalid
Copy link
Member Author

System.Reflection.Invoke.Field_SetStatic_int also regressed due to the JIT choosing different CSEs after the profile maintenance logic added in #111047 changed the CSE cost computations. Introducing a post-morph profile repair phase would level the playing field a bit by ensuring the CSE phase is working with a consistent profile relatively often; regressions remaining after this can be attributed to the mentioned flaws in CSE's cost model.

@amanasifkhalid
Copy link
Member Author

This leaves about 20 benchmarks that meaningfully regressed either due to 3-opt (#103450, #110277) or late profile repair (#111915), and many of the regressions stemming from the latter are layout-related. After analyzing the most significant regressions, it seems 3-opt successfully optimizes for its cost model often enough that we don't see cases in the perf lab where it clearly misses a globally-optimal solution. However, these remaining regressions show 3-opt's cost model does not always capture the nuances of the target architecture. @AndyAyersMS has proposed some ideas for building a more sophisticated model: Aside from trying to differentiate the costs of different types of branches, we could also try to model how each partition swap changes the lengths of the jumps within each partition. I still don't know of a way to compute this cheaply, but a brute-force solution could be useful for determining if we can gain anything from this.

For .NET 10, I don't plan on significantly changing 3-opt's cost model at this point. Thus, I'm going to start closing tracking issues for layout-related regressions, though I'll keep the profile/CSE ones open for now since I have some current/planned PRs that might help them. Before doing anything, I'll build a table of improvements corresponding to the PRs analyzed above so we can get a sense of what we're gaining from this tradeoff.

@amanasifkhalid
Copy link
Member Author

Notes Recent Score Orig Score arm64 Ubuntu Surface Windows Viper Ubuntu Viper Windows Benchmark
0.90 0.89 0.92
0.90
0.89
0.89
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 4, Options: NonBacktracking)
0.90 0.92 0.88
0.91
0.92
0.92
0.91
0.93
System.Collections.ContainsFalse(Int32).FrozenSet(Size: 512)
0.90 0.81 0.90
0.81
System.Collections.ContainsTrueComparer(Int32).ImmutableHashSet(Size: 512)
0.90 0.87 0.90
0.87
Microsoft.Extensions.Configuration.Xml.XmlConfigurationProviderBenchmarks.Load(FileName: "deep.xml")
0.90 0.91 0.90
0.91
MicroBenchmarks.Serializers.Json_ToStream(LoginViewModel).SystemTextJson_Reflection_
0.90 0.86 0.90
0.86
System.Collections.TryGetValueFalse(String, String).ImmutableDictionary(Size: 512)
0.90 0.65 0.83
0.64
0.97
0.67
Benchstone.BenchI.BubbleSort.Test
0.90 0.90 0.83
0.84
0.93
0.93
0.94
0.94
System.Collections.Perf_Frozen(NotKnownComparable).Contains_True(Count: 64)
0.90 0.88 0.88
0.87
0.91
0.89
System.Collections.IndexerSet(Int32).SortedList(Size: 512)
0.90 0.94 0.90
0.94
System.Buffers.Tests.ReadOnlySequenceTests(Byte).IterateGetPositionSingleSegment
0.90 0.94 0.90
0.94
System.Buffers.Tests.RentReturnArrayPoolTests(Byte).ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True)
0.90 0.90 0.89
0.91
0.89
0.89
0.92
0.91
System.Collections.ContainsTrueComparer(Int32).FrozenSet(Size: 512)
0.89 0.87 0.89
0.87
System.Tests.Perf_Random.Next_long_long
0.89 0.84 0.89
0.84
0.89
0.84
System.Collections.IterateForEach(Int32).Dictionary(Size: 512)
0.89 0.90 0.90
0.92
0.88
0.88
Benchstone.BenchI.Midpoint.Test
0.89 0.88 0.89
0.88
Span.Sorting.BubbleSortArray(Size: 512)
0.89 0.89 0.89
0.89
SeekUnroll.Test(boxedIndex: 3)
0.89 0.85 0.89
0.85
System.Memory.Span(Byte).IndexOfAnyFiveValues(Size: 33)
0.89 0.88 0.83
0.83
0.95
0.94
PerfLabTests.LowLevelPerf.ForeachOverList100Elements
0.89 0.88 0.89
0.88
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, None, False))
0.89 0.88 0.87
0.88
0.90
0.89
System.Collections.IterateForEach(Int32).ConcurrentStack(Size: 512)
0.89 0.89 0.89
0.89
System.Linq.Tests.Perf_Enumerable.WhereSelect(input: Array)
0.88 0.81 0.88
0.81
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 5, Options: NonBacktracking)
0.88 0.90 0.88
0.90
System.Numerics.Tensors.Tests.Perf_FloatingPointTensorPrimitives(Single).Ieee754Remainder_Vector(BufferLength: 128)
0.88 0.88 0.88
0.88
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, False))
0.88 0.85 0.88
0.85
System.Collections.Sort(BigStruct).Array_ComparerClass(Size: 512)
0.88 0.89 0.88
0.89
SeekUnroll.Test(boxedIndex: 1)
0.88 0.89 0.88
0.89
SeekUnroll.Test(boxedIndex: 19)
0.88 0.85 0.88
0.85
System.Collections.ContainsKeyTrue(Int32, Int32).SortedList(Size: 512)
0.87 0.89 0.87
0.89
System.Tests.Perf_Random.Next_long
0.87 0.86 0.87
0.86
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_BoostDocs_Simple.IsMatch(Id: 3, Options: NonBacktracking)
0.87 0.88 0.87
0.88
System.Collections.ContainsTrue(Int32).FrozenSet(Size: 512)
0.87 0.87 0.83
0.84
0.91
0.90
System.Tests.Perf_Random.ctor_seeded
0.87 0.89 0.89
0.90
0.84
0.89
System.Collections.ContainsKeyTrue(Int32, Int32).SortedDictionary(Size: 512)
0.87 0.90 0.87
0.90
System.Collections.Sort(BigStruct).Array_Comparison(Size: 512)
0.86 0.85 0.86
0.85
System.Collections.ContainsKeyFalse(Int32, Int32).SortedList(Size: 512)
0.86 0.88 0.86
0.88
System.Collections.IterateForEachNonGeneric(Int32).ArrayList(Size: 512)
0.86 0.87 0.92
0.92
0.80
0.82
System.Collections.TryGetValueTrue(Int32, Int32).SortedDictionary(Size: 512)
0.86 0.85 0.86
0.85
System.Collections.ContainsKeyTrue(String, String).ImmutableDictionary(Size: 512)
0.86 0.87 0.88
0.88
0.84
0.87
System.Collections.IndexerSet(Int32).SortedDictionary(Size: 512)
0.85 0.86 0.85
0.86
System.Text.Perf_Ascii.ToLower_Chars(Size: 128)
0.85 0.86 0.85
0.86
System.Tests.Perf_Char.Char_IsLower(input: "Good afternoon, Constable!")
0.85 0.79 0.85
0.79
System.Collections.ContainsTrue(String).Queue(Size: 512)
0.84 0.89 0.84
0.89
System.Collections.TryGetValueFalse(Int32, Int32).SortedList(Size: 512)
0.83 0.83 0.83
0.83
System.Collections.IterateForEach(Int32).List(Size: 512)
0.83 0.82 0.83
0.82
System.Net.Tests.Perf_WebUtility.Decode_NoDecodingRequired
0.83 0.87 0.83
0.87
System.Collections.ContainsTrue(Int32).ImmutableHashSet(Size: 512)
0.83 0.86 0.87
0.90
0.78
0.82
System.Linq.Tests.Perf_Enumerable.LastWithPredicate_FirstElementMatches(input: IOrderedEnumerable)
0.82 0.82 0.83
0.83
0.82
0.82
System.Collections.Perf_Frozen(Int16).Contains_True(Count: 64)
0.82 0.81 0.84
0.83
0.80
0.80
System.Collections.TryGetValueTrue(BigStruct, BigStruct).SortedList(Size: 512)
0.82 0.72 0.85
0.74
0.79
0.71
System.Collections.ContainsKeyFalse(Int32, Int32).SortedDictionary(Size: 512)
0.81 0.80 0.81
0.80
Benchstone.BenchI.XposMatrix.Test
0.81 0.79 0.81
0.79
System.Collections.ContainsFalse(Int32).ImmutableHashSet(Size: 512)
0.81 0.83 0.81
0.80
0.80
0.86
System.MathBenchmarks.Double.ScaleB
0.80 0.80 0.80
0.80
System.Collections.ContainsTrueComparer(Int32).SortedSet(Size: 512)
0.80 0.79 0.80
0.79
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!")
0.79 0.77 0.79
0.77
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 4)
0.79 0.86 0.79
0.86
Span.Sorting.BubbleSortSpan(Size: 512)
0.79 0.78 0.79
0.78
System.Collections.CreateAddAndClear(String).ConcurrentQueue(Size: 512)
0.78 0.75 0.64
0.60
0.85
0.83
0.87
0.84
System.Collections.IterateFor(Int32).ImmutableList(Size: 512)
0.78 0.74 0.65
0.60
0.84
0.84
0.87
0.79
System.Collections.IterateFor(String).ImmutableSortedSet(Size: 512)
0.78 0.83 0.78
0.83
System.Collections.ContainsTrue(String).ImmutableHashSet(Size: 512)
0.77 0.82 0.76
0.81
0.79
0.84
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512)
0.77 0.77 0.80
0.80
0.73
0.75
System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000)
0.76 0.73 0.67
0.63
0.87
0.84
System.Collections.IterateFor(Int32).ImmutableSortedSet(Size: 512)
0.76 0.90 0.79
0.89
0.73
0.92
System.Collections.IterateForEach(String).ImmutableHashSet(Size: 512)
0.76 0.73 0.76
0.73
System.Collections.TryGetValueFalse(Int32, Int32).SortedDictionary(Size: 512)
0.75 0.75 0.75
0.75
System.Collections.IterateForEach(Int32).FrozenSet(Size: 512)
0.74 0.78 0.53
0.64
0.84
0.82
0.91
0.91
System.Collections.ContainsKeyTrue(Int32, Int32).ImmutableDictionary(Size: 512)
0.73 0.73 0.69
0.69
0.75
0.75
0.74
0.74
Struct.SpanWrapper.WrapperSum
0.72 0.80 0.72
0.80
System.Memory.Span(Byte).IndexOfAnyFiveValues(Size: 4)
0.71 0.61 0.71
0.61
System.Collections.ContainsFalse(String).Queue(Size: 512)
0.71 0.75 0.80
0.74
0.62
0.76
System.Memory.Span(Int32).BinarySearch(Size: 4)
0.71 0.61 0.75
0.75
0.67
0.50
PerfLabTests.LowLevelPerf.ObjectStringIsString
0.69 0.81 0.69
0.81
System.Collections.IterateForEach(Int32).ImmutableHashSet(Size: 512)
0.69 0.55 0.67
0.67
1.00
0.50
0.50
0.50
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj
0.69 0.70 0.69
0.70
System.Collections.TryGetValueFalse(SmallClass, SmallClass).ImmutableDictionary(Size: 512)
0.67 0.73 0.84
0.79
0.62
0.72
0.57
0.68
System.Memory.Span(Int32).BinarySearch(Size: 33)
0.66 0.62 0.58
0.58
0.76
0.67
System.Collections.TryGetValueTrue(Int32, Int32).ImmutableDictionary(Size: 512)
0.66 0.61 0.66
0.61
System.Collections.ContainsTrue(String).List(Size: 512)
0.65 0.70 0.83
0.82
0.63
0.76
0.51
0.56
System.Memory.Span(Int32).BinarySearch(Size: 512)
0.64 0.65 0.68
0.67
System.Tests.Perf_Char.Char_IsUpper(input: "Good afternoon, Constable!")
0.64 0.69 0.64
0.69
System.Collections.IterateFor(String).ImmutableList(Size: 512)
0.60 0.70 0.60
0.70
System.Collections.ContainsFalse(String).ImmutableArray(Size: 512)
0.54 0.62 0.54
0.62
GuardedDevirtualization.TwoClassInterface.Call(testInput: pB = 0.50)
0.50 0.50 0.50
0.50
0.50
0.50
PerfLabTests.CastingPerf2.CastingPerf.IntObj
0.43 0.48 0.43
0.48
GuardedDevirtualization.TwoClassInterface.Call(testInput: pB = 0.40)
0.36 0.36 0.36
0.36
Struct.GSeq.FilterSkipMapSum
0.26 0.26 0.26
0.26
Benchstone.BenchF.NewtR.Test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

1 participant