Dynamic PGO #43618

AndyAyersMS · 2020-10-19T23:12:28Z

Epic for improving how the jit produces and consumes profile data, with an emphasis on the "dynamic" scenario where everything happens in-process.

Much of the work is also applicable to AOT PGO scenarios.

All non-stretch items are completed for .NET 6. We'll open a follow-on issue to capture the stretch items below and new work envisioned for .NET 7.

Link to related github project

Overview document: Dynamic PGO

(intro from that doc)

Profile based optimization relies heavily on the principle that past behavior is a good predictor of future behavior. Thus observations about past program behavior can steer optimization decisions in profitable directions, so that future program execution is more efficient.

These observations may come from the recent past, perhaps even from the current execution of a program, or from the distant past. Observations can be from the same version of the program or from different versions.

Observations are most often block counts, but can cover many different aspects of behavior; some of these are sketched below.

A number of important optimizations are really only practical when profile feedback is available. Key among these is aggressive inlining, but many other speculative, time-consuming, or size-expanding optimizations fall in this category.

Profile feedback is especially crucial in JIT-based environments, where compile time is at a premium. Indeed, one can argue that the performance of modern Java and Javascript implementations hinges crucially on effective leverage of profile feedback.

Profile guided optimization benefits both JIT and AOT compilation. While this document focuses largely on the benefits to JIT compilation, but much of what follows is also applicable to AOT. The big distinction is ease of use -- in a jitted environment profile based optimization can be done automatically, and so can be offered as a platform feature without requiring any changes to applications.

.NET currently has a somewhat arm's-length approach to profile guided optimization, and does not obtain much benefit from it. Significant opportunity awaits us if we can tap into this technology

.NET 6 Scenarios

[Dynamic PGO] Users can opt-into a dynamic PGO mode where their application is automatically profiled and optimized as methods are tiered up. Full benefit may require enabling other non-default behaviors (eg enabling QuickJitForLoops, disabling ReadyToRun).
[Static PGO] Users can collect profile data created by jit applied instrumentation and apply it to future runs of their application, both for jitted and prejitted code. This mode also replaces the current IBC technology and is used internally in our own builds.
[Sample based PGO] Users can provide sample data from performance profilers that the jit can ultimately leverage to improve code generation.

Work items

(stretch) indicates things that are not going make it into .NET 6.0.

Representation of Profile Data

use floating point for block weights (Float weights #44983)
(stretch) remove edge weight and min/max in favor of successor likelihood (in progress; JIT: refactor edge weight representation #46885)
(stretch) normalize counts from the start (JIT: use normalized counts throughout #46883)

Incorporation of profile data

ensure that 'fgComputeBlockAndEdgeWeights' is not overly pessimistic, or, remove it all together. (JIT: tolerate edge profile inconsistencies better. #50213, JIT: don't allow negative edge weights #52884)
(stretch) support for user annotations (RyuJIT: Allow developers to provide Branch Prediction Information #6225)
(stretch) move building of pred lists earlier
(stretch) implement profile synthesis
(stretch) implement blended/hedged profiles
make sure jit does reasonable things when caller has no PGO, but callee does (JIT: fixes for mixed PGO/nonPGO compiles #50633) (see this note). Common now even with dynamic PGO, now that we've enabled minimal probing and so won't have profile data for simple "wrapper" methods.

Heuristics and Optimization

Instrumentation

(stretch) handle loops to offset 0 better (JIT: use normalized counts throughout #46883)
minimum weight spanning tree instrumentation (JIT: efficient profiling schemes #46882, via JIT: refactor instrumentation code #47509, JIT: run instrumentation phase just after importing #47476, JIT: split up some parts of flowgraph.cpp #47072, JIT: let instrumentor decide which blocks to process #47597, JIT: fix issues with profile incorporation phase #47723, JIT: fix interaction of PGO and jitstress #47876 and finally Spanning tree instrumentation #47959)
omit probes for single-block methods (JIT: update jit config defaults for PGO #49267)
(stretch) better tier0 codegen for probe sequences
probes for class types (Initial version of class profiling for PGO #45133)
(stretch) Fine tuning of class probes (PGO: class profile details we need to get right #48549)
(stretch) consider devirtualizing in tier0 to reduce class probe overhead (need to revisit this)
(stretch) Complications with instrumentation and OSR (JIT: resolve issues with OSR and PGO #47942)

Sample Based PGO

Prototype the ability for the jit to consume sampled profile data (Add SPGO support and MIBC comparison in dotnet-pgo #52765)
- Validate accuracy of sampled profiles on JIT benchmarks
- Validate accuracy of sampled profiles on training scenarios
(stretch) Consider strategies for incorporating stale profile data
(stretch) Store debug annotations on the side
Track IL offsets when inlining (Start tracking debug info for inlined statements #61220)
Improve accuracy of IL offsets in optimized code (Some more precise debug info improvements #61419)

Runtime

enable reading of prejit profile data when jitting (Provide block counts in tiered compilation from R2R images #13672. Support via crossgen2 R2R is in Pgo phase3 #47558; data flowing into assembly prejitting is in Enable the latest managed pgo data #49793)
(stretch) time and space efficient type probes
(stretch) other heuristics to drive jit speculation on types

Maintenance

Add profile data consistency checker (JIT: initial version of a profile checker #42481)
(stretch) Implement profile reconstruction scheme
Uncover and fix significant maintenance issues (JIT: make profile data available to inlinees #42277, JIT: some small profile related fixes #43408, JIT: change loop inversion edge weight updates and add phase #48364, PGO considerations for finally cloning #48925, loop cloning and pgo #48850, JIT: profile updates for finally optimizations #49139)
Consider allowing inlinees to "scale up" their counts if call site count is greater than inlinee entry count (JIT: allow inlinee profile scale-up #48280)

Debugging and Diagnostics

Create modes where PGO data is readily available
Create tools to analyze profile data
- virtual and interface call analysis tool
Ensure Dynamic PGO works properly with SPMI (in progress; Fix SPMI to handle replays of BBINSTR jit method contexts #41386, SPMI: make method identity dependent on jit flags and isa flags #48082, SPMI: tolerate null pResolvedToken #48208, SPMI: adjust near differ offset compare logic #48245, Add MCS verb to dump jit flags histogram #48281)
Graphical dump of flow graph with profile data (JIT: show profile data for dot flowgraph dumps #42657)
Tooling for verifying PGO is working as expected (as in Fix weight computation in jit #47470)
(stretch) Methodology for asm diffs with (dynamic) PGO

Testing and CI

Add suitable testing to inner and outerloop CI (Add pgo testing to outerloop #53301)
Add SPMI collections that use dynamic PGO (not yet automated)
Cross-verify that devirtualization implies 100% likely class profiles (JIT: pgo/devirt diagnostic improvements #53247)

Performance

Look at performance on realistic apps (PGO measurements on TE)
Look at overhead of instrumentation

Related issues:

Also of note:

Dynamic PGO should provide one route to enable Guarded Devirtualization
The analysis of the experiment to develop inlining heuristics via machine learning suggests it would have been more successful if there was profile feedback.

category:planning
theme:planning
skill-level:expert
cost:large

JulieLeeMSFT · 2020-11-09T19:57:48Z

Bolded "in progress" work item.

EgorBo · 2020-11-10T00:10:02Z

Quick idea: don't expand x / c or x % c into set of magic operations in cold blocks, e.g.
x % 10 is currently optimized into:

       BA67666666           mov      edx, 0x66666667
       8BC2                 mov      eax, edx
       F7E9                 imul     edx:eax, ecx
       8BC2                 mov      eax, edx
       C1E81F               shr      eax, 31
       C1FA02               sar      edx, 2
       03C2                 add      eax, edx
       8D0480               lea      eax, [rax+4*rax]
       03C0                 add      eax, eax
       2BC8                 sub      ecx, eax
       8BC1                 mov      eax, ecx

while could be just

       8BC1                 mov      eax, ecx
       99                   cdq      
       41F7F8               idiv     edx:eax, 10
       8BC2                 mov      eax, edx

benaadams · 2021-04-13T23:49:17Z

Anecdotal evidence of PGO making SIMD worse:

However, @aalmada hasn't been able to isolate it for a repo/issue https://twitter.com/AntaoAlmada/status/1382033309052588036

AndyAyersMS · 2021-04-14T01:23:49Z

Happy to investigate if there's a repro.

Couple random thoughts:

I've seen BDN struggle at times to get the right number of iterations with TieredPGO enabled. At the very least don't try a fixed iteration strategy, it may take longer to reach peak perf with PGO (let me know, if so)
Depending on how the code is written, it's possible to get stuck in Tier0 codegen -- or at least spend more time there than is healthy --though less likely with BDN. You can try COMPlus_TC_OnStackReplacement=1 to see if it counteracts this.
We are still sorting through PGO issues, so it's also possible you are running into one of those. 6P3 may be pretty rough in some spots. From my standpoint, regressions seen with current main builds will be more interesting.

aalmada · 2021-04-27T09:26:11Z

@AndyAyersMS I submitted an issue with more information #51915

AndyAyersMS · 2021-06-09T20:54:47Z

Here's a current comparison of roughly 3400 microbenchmarks running with a variety of PGO configurations on Windows x64.

In .NET 6.0, the default behavior is to use the Static PGO data available in the framework assemblies. This configuration is the
baseline run for all data in the table below.

All measurements are done via Benchmark.NET, which (in principle) should be measuring the performance of Tier1 jitted code.

The configurations measured are:

No PGO inhibits the JIT from using the Static PGO data when generating Tier1 code. This configuration is only interesting in that it allows us to gauge the impact of Static PGO.
Dynamic PGO enables the JIT to do in-process profiling of methods that pass through Tier0, and then use this data at Tier1. If a method bypasses Tier0, the Static PGO data (if any) is used instead.
Full PGO forces all methods to pass through Tier0, so all methods at Tier1 can leverage the in-process profile data. It imposes a startup premium on an application in order to obtain the best steady-state performance.

The data below shows crude histograms of the ratios of baseline to configuration performance for the microbenchmarks. Values less than 1.0 mean that the baseline is running faster than the configuration; values larger than 1.0 mean that the configuration is running faster than the baseline.

The last entry shows the geometric mean of the ratios; this gives a rough figure of merit for the entire configuration.

From the No PGO data, we can see that Static PGO (the default configuration) provide an overall improvement of around 1.5% on microbenchmark performance.

Dynamic PGO offers roughly 1% improvement over default (so a 2.5% impact over no PGO).

Full PGO offers roughly 6% improvement over default (so 7.5% impact over no PGO).

Base/Diff Ratio	No Pgo	Dynamic Pgo	Full Ppgo
< 0.5	16	38	53
0.500	1	1	3
0.525	1	1	3
0.550	2	2	6
0.575	2	3	2
0.600	1	2	12
0.625	6	2	10
0.650	3	3	5
0.675	5	1	7
0.700	2	2	6
0.725	3	3	5
0.750	12	3	18
0.775	11	13	17
0.800	16	9	18
0.825	23	25	35
0.850	32	33	38
0.875	47	46	43
0.900	82	45	85
0.925	140	82	135
0.950	272	175	195
0.975	1317	933	664
1.000	1065	1032	610
1.025	164	374	287
1.050	82	216	206
1.075	61	136	159
1.100	30	64	116
1.125	13	52	81
1.150	6	31	55
1.175	12	21	41
1.200	11	26	31
1.225	6	14	53
1.250	2	7	13
1.275	5	11	20
1.300	0	7	7
1.325	4	2	7
1.350	2	5	8
1.375	1	4	6
1.400	3	1	13
1.425	0	0	13
1.450	0	5	5
1.475	2	3	9
>= 1.5	11	38	210
Geomean	0.984	1.012	1.062

AndyAyersMS · 2021-06-10T01:16:03Z

No PGO: Extreme Results

Taking a look at the best and worst results for No PGO, here's what we see.

Note the comparison sense here is opposite, lower ratios are better for these results, as the baseline is Static PGO (default), and the configuration is no PGO. So the test that benefits most from static PGO is System.Memory.ReadOnlySpan.StringAsSpan.

~~Bottom~~ Top 20 results

Base	Diff	Base/Diff	Test
0.012	0.33	0.036	System.Memory.ReadOnlySpan.StringAsSpan
0.025	0.65	0.038	System.Memory.Constructors(Byte).ArrayAsSpan
0.027	0.66	0.041	System.Memory.Constructors(Byte).SpanImplicitCastFromArray
0.028	0.65	0.043	System.Memory.Constructors(Byte).ReadOnlySpanImplicitCastFromArray
0.022	0.34	0.064	System.Memory.MemoryMarshal(Int32).GetReference
0.03	0.33	0.089	System.Memory.MemoryMarshal(Byte).AsBytes
0.042	0.36	0.116	System.Memory.Constructors(Byte).SpanFromArray
0.042	0.33	0.129	System.Memory.MemoryMarshal(Int32).CastToByte
0.16	1.2	0.135	System.Numerics.Tests.Perf_Vector4.MaxBenchmark
0.05	0.36	0.139	System.Memory.MemoryMarshal(Byte).GetReference
0.06	0.37	0.162	System.Memory.Constructors(Byte).ReadOnlySpanFromArray
0.39	2.2	0.179	System.Numerics.Tests.Perf_Vector4.MultiplyOperatorBenchmark
0.078	0.24	0.324	System.Numerics.Tests.Perf_Vector4.SquareRootBenchmark
0.31	0.66	0.473	System.Tests.Perf_Char.GetUnicodeCategory(c: 'a')
0.31	0.66	0.475	System.Tests.Perf_Char.GetUnicodeCategory(c: '.')
2.2	4.6	0.477	System.Tests.Perf_Int32.ToString(value: 4)
0.34	0.67	0.518	System.Memory.MemoryMarshal(Int32).AsBytes
0.36	0.68	0.533	System.Memory.MemoryMarshal(Byte).CastToByte
2.1	3.8	0.559	System.Tests.Perf_Int16.ToString(value: 0)
2.5	4.4	0.562	System.Tests.Perf_String.TrimStart(s: "Test")

The charts for the top 5 results show these look like consistent wins for Static PGO (blue is baseline == Static PGO, purple is No PGO, comparison above is just the last two data points).

Looking at the test history for StringAsSpan we can see a clear perf drop when Static PGO data was first included in our assemblies:

~~Top~~ Bottom 20 results

Base	Diff	Base/Diff	Test
1.3E+06	4.9E+05	2.588	System.Diagnostics.Perf_Process.GetProcessesByName
1.6E+03	6.1E+02	2.571	System.Collections.CtorDefaultSize(Int32).ConcurrentBag
0.63	0.27	2.288	System.Numerics.Tests.Perf_Vector2.MultiplyByScalarOperatorBenchmark
0.82	0.41	2.004	System.Tests.Perf_Type.GetTypeFromHandle
0.31	0.16	1.926	System.Numerics.Tests.Perf_Vector4.AddOperatorBenchmark
5.7E+05	3.3E+05	1.730	System.Collections.Sort(String).Array_ComparerStruct(Size: 512)
6.7E+03	4.1E+03	1.633	System.Collections.ContainsKeyFalse(Int32, Int32).Dictionary(Size: 512)
2.5	1.5	1.617	System.Tests.Perf_Boolean.ToString(value: False)
8.8E+03	5.6E+03	1.576	System.Buffers.Tests.RentReturnArrayPoolTests(Object).SingleParallel(RentalSize: 4096, ManipulateArray: True, Async: True, UseSharedPool: True)
0.91	0.6	1.522	System.Numerics.Tests.Constructor.ConstructorBenchmark_UInt64
5.7E+03	3.8E+03	1.515	System.Collections.Sort(Int32).Array(Size: 512)
1E+05	6.9E+04	1.491	System.Collections.CreateAddAndClear(String).ConcurrentDictionary(Size: 512)
1.8E+05	1.2E+05	1.483	System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 1)
1.5E+03	1.1E+03	1.419	System.Collections.CtorDefaultSize(String).ConcurrentBag
0.01	0.0072	1.410	System.Numerics.Tests.Perf_Quaternion.EqualityOperatorBenchmark
1.9E+03	1.4E+03	1.402	System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))
9E+04	6.5E+04	1.388	System.Collections.TryAddDefaultSize(String).ConcurrentDictionary(Count: 512)
4	2.9	1.371	System.Numerics.Tests.Perf_Plane.TransformByQuaternionBenchmark
4.4	3.2	1.358	PerfLabTests.CastingPerf.CheckArrayIsArrayByVariance
1.1E+03	8.1E+02	1.342	System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 100, TestCase: LongSingleLine)

The 5 "worst" results here are on fairly noisy tests (blue is baseline, purple is No PGO, comparison is just the last two data points). I don't have a good way of controlling for noise just yet, so one might in fact question the summary analysis above, but one hopes over such a large number of tests these noise issues average out.

So it is going to take some more detailed analysis to track down tests that clearly regress with the Static PGO data.

AndyAyersMS · 2021-06-10T01:29:30Z

Dynamic PGO: Extreme Results

Here Static PGO is the baseline, and Dynamic PGO the diff. So higher is better for Dynamic PGO, and the "bottom" results are tests that fare poorly with Dynamic PGO.

We suspect that some of these very poor results are cases where BenchmarkDotNet doesn't run enough iterations to get all the key methods running Tier1 code, but this needs more investigation.

Bottom 20 results

Base	Diff	Base/Diff	Test
0.011	0.33	0.034	System.Memory.Constructors_ValueTypesOnly(Int32).SpanFromPointerLength
0.02	0.35	0.055	System.Memory.Constructors(Byte).ReadOnlySpanFromArrayStartLength
0.021	0.33	0.064	System.Memory.Constructors_ValueTypesOnly(Byte).ReadOnlyFromPointerLength
0.023	0.33	0.070	System.Memory.Constructors_ValueTypesOnly(Byte).SpanFromPointerLength
1.2E+03	1.1E+04	0.110	MicroBenchmarks.Serializers.Json_ToStream(LoginViewModel).DataContractJsonSerializer_
0.042	0.36	0.115	System.Memory.Constructors(Byte).SpanFromArray
0.069	0.53	0.130	System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark
0.042	0.31	0.137	System.Memory.MemoryMarshal(Int32).CastToByte
0.074	0.53	0.139	System.Numerics.Tests.Perf_Quaternion.NegateBenchmark
0.05	0.35	0.141	System.Memory.MemoryMarshal(Byte).GetReference
0.077	0.53	0.145	System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark
0.059	0.37	0.162	System.Memory.Constructors(Byte).MemoryMarshalCreateReadOnlySpan
0.06	0.37	0.163	System.Memory.Constructors(Byte).ReadOnlySpanFromArray
2.7E+03	1.6E+04	0.175	MicroBenchmarks.Serializers.Json_ToStream(Location).DataContractJsonSerializer_
0.094	0.51	0.184	System.Numerics.Tests.Perf_Plane.CreateFromScalarXYZDBenchmark
0.095	0.48	0.198	System.Numerics.Tests.Perf_Plane.CreateFromVector3WithScalarDBenchmark
0.12	0.42	0.279	System.Numerics.Tests.Perf_Vector2.CreateFromScalarXYBenchmark
0.16	0.55	0.283	System.Numerics.Tests.Perf_Vector4.CreateFromScalar
0.12	0.42	0.284	System.Numerics.Tests.Perf_Vector2.UnitYBenchmark
0.12	0.42	0.288	System.Numerics.Tests.Perf_Vector2.UnitXBenchmark

Top 20 results

Base	Diff	Base/Diff	Test
0.67	8.5E-05	7840.619	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Count_Array
0.31	0.00019	1639.858	System.Tests.Perf_Lazy.ValueFromAlreadyInitialized
0.33	0.00058	570.942	System.Memory.Constructors(Byte).ArrayAsSpanStartLength
0.33	0.0059	55.997	System.Memory.MemoryMarshal(Int32).CastToInt
0.36	0.014	26.559	System.Memory.MemoryMarshal(Byte).CastToByte
0.37	0.026	14.201	System.Memory.Constructors(Byte).SpanFromArrayStartLength
0.0054	0.00058	9.315	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Ctor_String
0.53	0.073	7.231	System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark
0.52	0.078	6.610	System.Numerics.Tests.Perf_Quaternion.DivideBenchmark
0.0049	0.00093	5.250	System.Collections.Concurrent.IsEmpty(Int32).Stack(Size: 0)
0.52	0.11	4.588	System.Numerics.Tests.Perf_Quaternion.AddOperatorBenchmark
0.52	0.12	4.402	System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark
0.52	0.12	4.279	System.Numerics.Tests.Perf_Quaternion.AddBenchmark
0.37	0.089	4.189	System.Numerics.Tests.Perf_Vector4.NegateOperatorBenchmark
1.4E+04	3.5E+03	4.015	System.Collections.Concurrent.Count(String).Dictionary(Size: 512)
0.4	0.1	3.776	System.Numerics.Tests.Perf_Vector2.NegateBenchmark
0.4	0.11	3.736	System.Numerics.Tests.Perf_Vector2.NegateOperatorBenchmark
0.42	0.11	3.711	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Indexer_FirstElement_Array
0.37	0.12	2.988	System.Numerics.Tests.Perf_Vector4.NegateBenchmark
1.6E+03	6E+02	2.599	System.Buffers.Tests.RentReturnArrayPoolTests(Object).SingleParallel(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False)

Not clear yet what's going on with some of these tests with outsized gains. Will fill in with analysis when I have it. Suspect the running time of these tests is so short that the measurement is below BDN's noise floor.

AndyAyersMS · 2021-06-10T01:30:44Z

Full PGO: Extreme Results

(more details as I have time to fill them in)

Bottom 20 results

Base	Diff	Base/Diff	Test
0.011	0.32	0.034	System.Memory.Constructors_ValueTypesOnly(Int32).SpanFromPointerLength
0.022	0.61	0.036	System.Memory.MemoryMarshal(Int32).GetReference
2.1E+08	5.5E+09	0.037	Burgers.Test1
2.6E+08	5.5E+09	0.047	Burgers.Test0
0.013	0.15	0.091	System.Collections.Concurrent.IsEmpty(String).Stack(Size: 0)
0.03	0.32	0.094	System.Memory.MemoryMarshal(Byte).AsBytes
0.05	0.5	0.099	System.Memory.MemoryMarshal(Byte).GetReference
0.0049	0.047	0.104	System.Collections.Concurrent.IsEmpty(Int32).Stack(Size: 0)
0.042	0.3	0.139	System.Memory.Constructors(Byte).SpanFromArray
9.7E+07	6.9E+08	0.142	LinqBenchmarks.Count00ForX
0.0038	0.025	0.155	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Ctor_Array
0.0054	0.032	0.166	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Ctor_String
0.078	0.46	0.170	System.Numerics.Tests.Perf_Vector3.ZeroBenchmark
5.6E+07	3.2E+08	0.174	Benchstone.BenchI.IniArray.Test
0.042	0.23	0.181	System.Memory.MemoryMarshal(Int32).CastToByte
0.06	0.31	0.196	System.Memory.Constructors(Byte).ReadOnlySpanFromArray
5E+08	2.5E+09	0.203	Benchstone.BenchI.Array2.Test
0.082	0.38	0.213	System.Numerics.Tests.Perf_Vector3.CreateFromScalarXYZBenchmark
1.6E+08	7.5E+08	0.213	LinqBenchmarks.Where00ForX
0.074	0.34	0.217	System.Numerics.Tests.Perf_Vector4.UnitZBenchmark

Top 20 results

Base	Diff	Base/Diff	Test
0.52	0.00018	2904.092	System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark
0.26	0.00023	1135.963	System.Numerics.Tests.Perf_Vector2.LengthSquaredBenchmark
0.88	0.0045	196.242	System.Numerics.Tests.Perf_Vector4.EqualsBenchmark
0.39	0.0024	163.275	System.Numerics.Tests.Perf_Vector4.SubtractOperatorBenchmark
0.057	0.0017	33.583	System.Memory.Constructors(Byte).ReadOnlySpanImplicitCastFromSpan
0.42	0.013	32.425	Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Indexer_FirstElement_Array
0.18	0.0067	26.641	System.Numerics.Tests.Perf_Vector3.OneBenchmark
0.92	0.035	26.273	System.Numerics.Tests.Perf_Vector2.EqualityOperatorBenchmark
0.4	0.019	21.239	System.Numerics.Tests.Perf_Vector2.NegateOperatorBenchmark
0.39	0.023	17.374	System.Numerics.Tests.Perf_Vector4.MultiplyOperatorBenchmark
0.16	0.0093	17.101	System.Numerics.Tests.Perf_Vector4.MaxBenchmark
0.33	0.022	14.821	System.Memory.Constructors(Byte).ArrayAsSpanStartLength
0.63	0.049	12.907	System.Numerics.Tests.Perf_Vector2.LerpBenchmark
0.64	0.068	9.424	System.Numerics.Tests.Perf_Vector2.TransformByQuaternionBenchmark
0.29	0.032	9.018	System.Numerics.Tests.Perf_Vector4.LengthSquaredBenchmark
0.43	0.051	8.551	System.Numerics.Tests.Perf_Vector2.MultiplyFunctionBenchmark
0.36	0.05	7.145	System.Memory.MemoryMarshal(Byte).CastToByte
0.6	0.085	7.073	XmlDocumentTests.XmlNodeTests.Perf_XmlNode.GetValue
0.44	0.066	6.621	System.Numerics.Tests.Perf_Vector2.AddOperatorBenchmark
0.45	0.069	6.445	System.Numerics.Tests.Perf_Vector2.DivideByVector2Benchmark

AndyAyersMS · 2021-06-10T21:13:54Z

TechEmpower Results (mean RPS)

EgorBo · 2021-06-11T15:19:03Z

@AndyAyersMS the latest round of TE benchmark for the inliner: #52708 (comment)

I'm also watching other metrics like time-to-first response, latency, memory, etc. I wonder how fast we can go so I'm testing a more aggressive version at the moment.

AndyAyersMS · 2021-07-08T21:40:28Z

Updated TechEmpower Results (mean RPS)

AndyAyersMS · 2021-07-12T18:02:27Z

Closing per updated top comment:

All non-stretch items are completed for .NET 6. We'll open a follow-on issue to capture the stretch items below and new work envisioned for .NET 7.

AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Team Epic labels Oct 19, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 19, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Oct 19, 2020

BruceForstall added this to the 6.0.0 milestone Oct 20, 2020

BruceForstall assigned AndyAyersMS Oct 20, 2020

JulieLeeMSFT mentioned this issue Nov 16, 2020

Developers can progressively optimize their .NET apps at runtime, using dynamic PGO data dotnet/core#5494

Closed

jeffschwMSFT added User Story A single user-facing feature. Can be grouped under an epic. Bottom Up Work Not part of a theme, epic, or user story and removed Team Epic labels Jan 20, 2021

JulieLeeMSFT mentioned this issue Jan 28, 2021

What's new in .NET 6 Preview 1 dotnet/core#5853

Closed

JulieLeeMSFT mentioned this issue Feb 24, 2021

What's new in .NET 6 Preview 2 dotnet/core#5889

Closed

JulieLeeMSFT mentioned this issue Mar 26, 2021

What's new in .NET 6 Preview 3 dotnet/core#5890

Closed

JulieLeeMSFT mentioned this issue Apr 27, 2021

What's new in .NET 6 Preview 4 dotnet/core#6098

Closed

JulieLeeMSFT mentioned this issue Jun 8, 2021

What's new in .NET 6 Preview 5 dotnet/core#6099

Closed

jakobbotsch mentioned this issue Jun 23, 2021

.NET Core 3.1 integer performance 12% slower than .NET 4.8 #34414

Closed

JulieLeeMSFT mentioned this issue Jun 29, 2021

What's new in .NET 6 Preview 6 [WIP] dotnet/core#6325

Closed

AndyAyersMS closed this as completed Jul 12, 2021

JulieLeeMSFT mentioned this issue Aug 10, 2021

What's new in .NET 6 Preview 7 dotnet/core#6444

Closed

ghost locked as resolved and limited conversation to collaborators Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic PGO #43618

Dynamic PGO #43618

AndyAyersMS commented Oct 19, 2020 •

edited

Loading

JulieLeeMSFT commented Nov 9, 2020

EgorBo commented Nov 10, 2020

benaadams commented Apr 13, 2021

AndyAyersMS commented Apr 14, 2021

aalmada commented Apr 27, 2021

AndyAyersMS commented Jun 9, 2021 •

edited

Loading

AndyAyersMS commented Jun 10, 2021

AndyAyersMS commented Jun 10, 2021

AndyAyersMS commented Jun 10, 2021

AndyAyersMS commented Jun 10, 2021 •

edited

Loading

EgorBo commented Jun 11, 2021 •

edited

Loading

AndyAyersMS commented Jul 8, 2021

AndyAyersMS commented Jul 12, 2021

Dynamic PGO #43618

Dynamic PGO #43618

Comments

AndyAyersMS commented Oct 19, 2020 • edited Loading

Overview document: Dynamic PGO

.NET 6 Scenarios

Work items

Representation of Profile Data

Incorporation of profile data

Heuristics and Optimization

Instrumentation

Sample Based PGO

Runtime

Maintenance

Debugging and Diagnostics

Testing and CI

Performance

Related issues:

Also of note:

JulieLeeMSFT commented Nov 9, 2020

EgorBo commented Nov 10, 2020

benaadams commented Apr 13, 2021

AndyAyersMS commented Apr 14, 2021

aalmada commented Apr 27, 2021

AndyAyersMS commented Jun 9, 2021 • edited Loading

AndyAyersMS commented Jun 10, 2021

No PGO: Extreme Results

AndyAyersMS commented Jun 10, 2021

Dynamic PGO: Extreme Results

AndyAyersMS commented Jun 10, 2021

Full PGO: Extreme Results

AndyAyersMS commented Jun 10, 2021 • edited Loading

TechEmpower Results (mean RPS)

EgorBo commented Jun 11, 2021 • edited Loading

AndyAyersMS commented Jul 8, 2021

Updated TechEmpower Results (mean RPS)

AndyAyersMS commented Jul 12, 2021

AndyAyersMS commented Oct 19, 2020 •

edited

Loading

AndyAyersMS commented Jun 9, 2021 •

edited

Loading

AndyAyersMS commented Jun 10, 2021 •

edited

Loading

EgorBo commented Jun 11, 2021 •

edited

Loading