Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pgo phase3 #47558

Merged
merged 9 commits into from
Feb 2, 2021
Merged

Pgo phase3 #47558

merged 9 commits into from
Feb 2, 2021

Conversation

davidwrighton
Copy link
Member

  • Fix class type probes (the increment of the count wasn't correct)
  • Use the 64bit integer encoders for all the pgo data
  • New section for R2R PE Files containing instrumentation data
  • R2RDump functionality to display all the data embedded in the file
  • Enable BBOPT for optimized builds in a more unconditional fashion

Future PGO work will include

  • Move Pgo type handle histogram processing into JIT (which will make type guessing work in crossgen2 as well as in the runtime)
  • Triggers for controlling Pgo data extraction
  • Size control for pgo instrumentation data

With this checkin, the feature is functional from a crossgen2.exe point of view, but its not polished, and it cannot be used from the Crossgen2 sdk integration (as the sdk does not have the ability to pass an extra pair of arguments to the compiler. That said, the following script can be used to demonstrate the feature.

The example expects that the following environment variables are set

CORE_ROOT = Pointing to the normal CORE_ROOT
DOTNET_PGO_PATH = Points to the location of the dotnet-pgo binary in the artifacts directory
__TestDotNetCmd = Points at the dotnet.cmd in the root of the enlistment

And that the application is called pgotest.csproj and is in the current directory. Assuming all of that is setup, then the following script will compile an application, run it under instrumentation, feed the results back into crossgen2, and then run the application again, this time with instrumentation enabled.

setlocal

rem Environment variables that will cause code not compiled by default to be compiled with instrumentation enabled, and for them to run for some time with the instrumentation in place.
set COMPLUS_TieredPGO=1
set COMPLUS_TC_QuickJitForLoops=1
set COMPLUS_TC_CallCountThreshold=10000

rem comment this out to skip the pgo driven devirtualization optimization
set COMPLUS_JitClassProfiling=1

rem Remove the comment below to generate instrumentation data for all methods, not just the non-R2R methods, unfortunately, this will also impact dotnet-trace
rem set COMPLUS_ZapDisable=1

dotnet build -p:Configuration=Release
dotnet-trace collect --providers Microsoft-Windows-DotNETRuntime:0x1E000080018:5 -- %CORE_ROOT%\corerun bin\Release\net5.0\pgotest.dll
%DOTNET_PGO_PATH%\dotnet-pgo.exe --trace-file trace.nettrace --output-file-name trace.mibc --uncompressed --pgo-file-type mibc

set COMPLUS_TieredPGO=
set COMPLUS_TC_QuickJitForLoops=
set COMPLUS_TC_CallCountThreshold=
set COMPLUS_JitClassProfiling=
set COMPLUS_ZapDisable=

rem Produce new output binary with crossgen2, passing the --mibc switch and --embed-pgo-data switch
md crossgenoutput
call %__TestDotNetCmd% %CORE_ROOT%\crossgen2\crossgen2.dll --map -O -o crossgenoutput\pgotest.dll -r %CORE_ROOT%\*.dll --mibc trace.mibc --embed-pgo-data bin\Release\net5.0\pgotest.dll

rem Run the application
call %CORE_ROOT%\corerun crossgenoutput\pgotest.dll

Loop implemented but still chasing bugs

Handle volatility issues

End to end type handle processing working

R2R dump support for pgo instrumentation data
- Fix type histogram processing to actually properly handling unknown type handles in the histogram
@@ -701,6 +701,22 @@ ReadyToRunInfo::ReadyToRunInfo(Module * pModule, PEImageLayout * pLayout, READYT
m_availableTypesHashtable = NativeHashtable(parser);
}

// For format version 4.1 and later, there is an optional table of instrumentation data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For format version 4.1 and later, there is an optional table of instrumentation data
// For format version 5.2 and later, there is an optional table of instrumentation data

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over the jit interface and pgo manager changes.

This should address #13672, right?

definedSymbols: new ISymbolDefinitionNode[] { this });
}

public override int ClassCode => 1887299452;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some process by which these values are chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a random number, it simply needs to not match one of the other ClassCode's, and having a bunch of entropy in the number makes a few things faster. This is (poorly) documented where the base ClassCode property is defined.

#endif // DACCESS_COMPILE

HRESULT PgoManager::getPgoInstrumentationResultsInstance(MethodDesc* pMD, SArray<ICorJitInfo::PgoInstrumentationSchema>* pSchema, BYTE**pInstrumentationData)
HRESULT PgoManager::getPgoInstrumentationResultsInstance(MethodDesc* pMD, BYTE** pAllocatedData, ICorJitInfo::PgoInstrumentationSchema** ppSchema, UINT32 *pCountSchemaItems, BYTE**pInstrumentationData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this prefer "live" data over R2R data?

For instance, will R2R assemblies loaded under ZapDisable lookup embedded profile data for their methods even though they may have gone through Tier0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, sure. I feel that ideally we would merge the data, but I don't really want to write that code right now, especially as it would be a duplication of the managed merger algorithm. Now that I think about it a bit, if we don't prefer in memory data we'll be losing all the potential for tiered pgo improving over static pgo, so... yeah, I'll swap it around.

maxCount = h.m_histogram[m].m_count;
}
}

if (maxCount > 0)
UINT32 maxKnownLikelihood = (100 * maxKnownCount) / h.m_totalCount;
if ((maxKnownCount > 0) && ((maxKnownCount == maxCount) || (maxKnownLikelihood > 33)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should filter results here. The jit can decide if likelihood warrants testing things (currently we have different thresholds for virtual and interface calls, for instance).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will do.

@davidwrighton
Copy link
Member Author

@AndyAyersMS This won't quite address #13672 as we haven't yet actually enabled this logic in our build systems, but yes, this is much of the technical work to get us there. Currently the system puts a bit too much of the pgo data into the file, causing too much size bloat to be something we can enable by default, and the dotnet-optimization repo isn't yet generating the mibc data, but we're getting there.

@davidwrighton
Copy link
Member Author

@AndyAyersMS I just wanted to make sure you were aware of the change to enable the BBOPT mode unconditionally when optimizing. It seemed like a good enough idea to me, but it is a tad scary.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jan 29, 2021

Currently the system puts a bit too much of the pgo data into the file,

Can you say more here? #46882 (which is a few days out) plus not instrumenting single-block methods should reduce the number of count records by roughly a factor of two. The latter may be problematic if we plan to use PGO for partial prejitting or global importance, but otherwise single-block method counts are uninteresting.

the change to enable the BBOPT mode unconditionally

I think it's fine. The gating factor is the availability of PGO data, so for now this shouldn't have any impact (unless you are also making legacy IBC data available at runtime, but I didn't see any code that would do that).

We will probably want to introduce a jit or runtime mode to suppress BBOPT once PGO data becomes more commonplace.

@davidwrighton
Copy link
Member Author

@AndyAyersMS well, the current implementation puts 100% of the data into the file. While the data is actually pretty well compressed, between the data and the bookkeeping to find the appropriate data there is measurable increase in size of the final file that is beyond the current desired level. (I don't have good numbers except for my really small test app where the increase in size is comparable to the size of the generated code. If that holds for larger binaries, we're talking a 30% increase in file size which is utterly unacceptable.). My goal is to get numbers out of our optimization tests, and see how it affects corelib, etc, and see what is to be done.

I have a number of ideas on how to reduce the size impact once it becomes usefully measurable, but there are likely quite a few more tweaks to come. Some ideas include:

  • A post-processed version of the type histogram that is generated during the merge process. This will be optimized for the actual use case of running getLikelyClass. This will strictly be a size improvement with probably no reduction in meaningful carried data. This could either come in a fairly general purpose form, or we could encode the getLikelyClass behavior directly into the data stream.
  • Don't carry data for methods which have insignificant amounts of execution in the test traces
  • Consider dropping data for methods with only a single block. There are a great many of these, and I strongly suspect that the bookkeeping cost for carrying the data is out of proportion to the value of having instrumentation information. As you note, it may be a better idea to not even bother to collect data for single block methods.
  • Consider holding data in the R2R image at methoddef granularity instead of working with instantiated methods when the results are similar, or use some algorithm to compare the various instantiation data items and conclude they are similar enough.

There isn't an official goal yet for exactly the size we're looking for, but my personal goal is that we shouldn't increase the size of our generated assemblies by more than 1% with this work.

@davidwrighton
Copy link
Member Author

@dotnet/crossgen-contrib

@AndyAyersMS
Copy link
Member

FYI @dotnet/jit-contrib for the "always set BBOPT when optimizing" bit of this change

@AndyAyersMS
Copy link
Member

My goal is to get numbers out of our optimization tests, and see how it affects corelib, etc, and see what is to be done.

If you mean "perf numbers" be aware there are still a lot of rough edges in the jit.

... we shouldn't increase the size of our generated assemblies by more than 1%

That seems ambitious.

We should split out the impact of the extra profile payload from the impact of the codegen changes it will engender.

Ideally the jit would be smart about producing compact code for things that are unlikely hot paths, and so "buy room" for the profile payload and/or more ambitious codegen on hot paths, but I don't think we're there yet, save for a few special cases.

@davidwrighton
Copy link
Member Author

@AnyAyersMS For numbers in this context, I'm more concerned about size numbers than wallclock style perf. And my opinion on size is that I know that we don't actually execute all that much of most of our binaries in the vast majority of applications, so most methods won't have any data at all. That will apply a substantial factor to reduce the size cost of carrying instrumentation data. As I said, I don't want to put down any strong statements about what we'll do until we have some form of numbers, but in the static pgo story (at least for our product shipped images, we have a much smaller size budget to work with than I'd like.)

@@ -236,6 +238,232 @@ private int GetSize()
}
}

public class PgoInfoKey : IEquatable<PgoInfoKey>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these classes to their own file?

_pgoInfo = _readyToRunReader.GetPgoInfoByKey(PgoInfoKey.FromReadyToRunMethod(this));
if (_pgoInfo == null)
{
_pgoInfo = new object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we init to new object() if we have no pgoInfo for a key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... my thought was to avoid a bunch of manual null checks, but I'll have to take a look and see if I was thinking straight at the time.

Copy link
Contributor

@nattress nattress left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks David!

@davidwrighton davidwrighton merged commit afa50f9 into dotnet:master Feb 2, 2021
@AndyAyersMS AndyAyersMS mentioned this pull request Feb 4, 2021
54 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Mar 4, 2021
@davidwrighton davidwrighton deleted the pgo_phase3 branch April 20, 2021 17:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants