Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicorejit unification #48326

Merged
merged 7 commits into from
May 14, 2021
Merged

Conversation

gbalykov
Copy link
Member

@gbalykov gbalykov commented Feb 16, 2021

This PR unifies non-generic/shared-generic methods with generic methods in multicorejit, allowing to keep the original jit order. This is required as part of #45748 change (specifically, #45748 (comment)), goal of which is to enable background type preloading using multicorejit.

Generic methods are stored in profile using their binary signature and non-generics methods are stored using method tokens. However, for non-generic methods record size has double (8 bytes now), which allowed to fit additional data. Thus, there's no limit on method token now (previously there was METHODINDEX_MASK). Also, limit for module index has increased from 512 (MAX_MODULES) to 4096. Additional info can be stored for each method like JIT_BY_APP_THREAD_TAG or NO_JIT_TAG (NO_JIT_TAG will be added in further changes).

Additionally, NDirect methods are removed, because their current implementation leads to asserts and exceptions. Also, few checks for number of cpus are configurable with env variables now, thus, allowing to use multicorejit on arm with cpu hotplug.

cc @alpencolt

@gbalykov gbalykov force-pushed the multicorejit-unification branch 4 times, most recently from 9d4d725 to 956aa05 Compare February 16, 2021 10:30
Base automatically changed from master to main March 1, 2021 09:07
@mangod9
Copy link
Member

mangod9 commented Mar 29, 2021

@gbalykov assume this is ready to merge? @noahfalk is this something you could be able to review?

@noahfalk
Copy link
Member

It may take me a couple days, but yes I'll take a look : ) Also @kouvel might be a good person to give it a look. Tiered compilation has some overlap with multicore JIT.

@noahfalk
Copy link
Member

noahfalk commented Apr 1, 2021

Spent some time tonight learning about this code. Not done yet but haven't forgotten : )

@gbalykov
Copy link
Member Author

gbalykov commented Apr 1, 2021

@mangod9 yes, this is ready for review and merge

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Most of the comment are some stylistic suggestions but I do think removing the requirement for multiple processors should be evaluated by others.

@mangod9 - Ultimately if a bug was uncovered I wouldn't be the dev who owned fixing it. Whoever that dev would be its probably good if they get a chance to look at this before it is merged : )


// 4. Maximum number of modules supported is MAX_MODULES
// 5. Maximum number of methods supported is MAX_METHODS
// 5. Simple module name stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
// 5. Simple module name stored
// 6. Simple module name stored

m_ModuleList[moduleIndex].methodCount ++;
DWORD dwLength;
BYTE * pBlob = (BYTE*)sigBuilder.GetSignature(&dwLength);
_ASSERTE(dwLength < MAX_SIGNATURE_LENGTH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What enforces that this length assertion will be true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an assertion before but the length limit was higher. I don't know how long signature lengths can get, this should probably be a real check and skip the method in that case.

DWORD sigSize = data2 & (MAX_SIGNATURE_LENGTH - 1);
DWORD dataSize = sigSize + sizeof(DWORD) * 2;
DWORD dwSize = ((DWORD)(dataSize + sizeof(DWORD) - 1) / sizeof(DWORD)) * sizeof(DWORD);
_ASSERTE(dwSize < MAX_SIGNATURE_LENGTH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier the code enforced that sigSize < MAX_SIGNATURE_LENGTH, but does anything enforce that AlignUp(sigSize + 8, 4) < MAX_SIGNATURE_LENGTH?

}
DWORD sigSize = data2 & (MAX_SIGNATURE_LENGTH - 1);
DWORD dataSize = sigSize + sizeof(DWORD) * 2;
DWORD dwSize = ((DWORD)(dataSize + sizeof(DWORD) - 1) / sizeof(DWORD)) * sizeof(DWORD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
DWORD dwSize = ((DWORD)(dataSize + sizeof(DWORD) - 1) / sizeof(DWORD)) * sizeof(DWORD);
DWORD dwSize = AlignUp(dataSize, sizeof(DWORD));

@@ -1149,12 +1139,9 @@ void MulticoreJitManager::SetProfileRoot(const WCHAR * pProfilePath)

#endif

if (g_SystemInfo.dwNumberOfProcessors >= 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this should get some scrutiny around performance. Even if removing it makes some scenarios better, probably there was a reason this was here. I don't think I'd be the right person to assess that part of the change though.
cc @mangod9 @kouvel @brianrob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed @noahfalk. @gbalykov, can you share more details about this change? My expectation is that if there is only one processor, then playing back a Multi-Core JIT profile is likely to impact start-up negatively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On arm devices cpus might be enabled dynamically (cpu hotplug). Thus, during coreclr_initialize there might be only one enabled cpu, yet there'll be more when there's enough work to do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbalykov. My first thought is that if we're going to remove this check, we're going to want to limit the removal to Linux arm builds. It looks like per https://community.arm.com/developer/tools-software/oss-platforms/w/docs/529/cpuidle-hotplug, hotplug is only available on Linux.

My next thought is that I'm wanting to understand more about the hotplug behavior. Multi-Core JIT looks at the CPU count once when it gets invoked, and that's it. It sounds like we might need to modify Multi-Core JIT to be hotplug aware (if that's possible) so that we don't needlessly do aggressive assembly loading and jitting on the critical path if we want to support the hotplug scenario on arm. Is it possible to be hotplug aware?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option may be to make it configurable for now via clrconfigvalues.h and keep the default as it is currently. On CPU-limited environments where there is actually only one processor available, the background thread could otherwise steal a fair bit of CPU time from foreground work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option may be to make it configurable for now via clrconfigvalues.h

I've added this solution instead of check removal. By default 2 cpus will still be needed

ptr = nullptr;
}

static bool isMethod(unsigned data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The rest of this code uses PascalCasing for method names so I would align to that style

Suggested change
static bool isMethod(unsigned data)
static bool IsMethod(unsigned data)

DWORD currentModuleBlockStart = GetTickCount();

// Only allow module blocking to occur a certain number of times.
MulticoreJitTrace(("ModuleRecord(%u) start module load",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MulticoreJitTrace(("ModuleRecord(%u) start module load",
MulticoreJitTrace(("ModuleDependency(%u) start module load",

}
}

return true;
MulticoreJitTrace(("ModuleRecord(%d) end module load, hr=%x",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MulticoreJitTrace(("ModuleRecord(%d) end module load, hr=%x",
MulticoreJitTrace(("ModuleDependency(%d) end module load, hr=%x",

// Find all subsequent methods and jit/load them reversed if reversed order is required
bool reversedOrder = true;

if (reversedOrder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if this is never false we may as well eliminate the if statement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually would be part of the next PR, that's why initially I left it here even if it was dead code. Removed from this PR

m_stats.m_nWalkBack += (short) count;
m_stats.m_nFilteredMethods += (short) (i + 1);

rcdLen = nSize - curSize;
}
else
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is dead code I think?

@kouvel
Copy link
Member

kouvel commented Apr 12, 2021

I'll take a look as well

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, some comments below. Thanks!

const int MULTICOREJITLIFE = 60 * 1000; // 60 seconds

const int MULTICOREJITBLOCKLIMIT = 10 * 1000; // 10 seconds
const unsigned MAX_MODULES = 0x10000; // maximum allowed number of modules (2^16 values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to support this many modules? A large array of this many elements is allocated so if it's not necessary maybe something like 0x400 or 0x1000 would be reasonable. It looks like MAX_MODULES - 1 is also used as a mask, could update to use a separate constant MODULE_MASK that is specific to the packing format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that 0х1000 would be enough


const unsigned METHODINDEX_MASK = 0x0FFFFF; // 20-bit method index
const unsigned SIGNATURE_LENGTH_OFFSET = 16; // offset of signature length
const unsigned MAX_SIGNATURE_LENGTH = 0x10000; // maximum allowed signature length (2^16 values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the usage it looks like the max allowed signature length is actually 0xffff. To make it more clear I suggest changing the value to 0xffff (and usage to match), and maybe rename to SIGNATURE_LENGTH_MASK since it is also used as a mask.


// These should be powers of 2 in order to implement fast check for rec type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fairly fast to extract the record type with a shift and to check the ID, it doesn't seem to be worth making these powers of 2 just for that purpose (though I could have missed something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I wanted was to have simple bitfield check in RecorderInfo::IsMethod instead of shift and equality comparison. I agree that this is not critical at all, fixed

BYTE * genericSignature;
unsigned data1;
unsigned data2;
BYTE * ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the uses of the fields data1, data2, and ptr are difficult to read/understand (the uses are spread out too much), I suggest encapsulating away the data fields by making them private, and use constructors and instance methods to operate on the data/packing/unpacking, would be more clear in usage and knowledge of the data formats would hopefully be limited to methods in the class. Using a union of structs as @noahfalk suggested would also be helpful.

@@ -359,12 +407,9 @@ class MulticoreJitRecorder
unsigned m_ModuleCount;
unsigned m_ModuleDepCount;

unsigned m_JitInfoArray[MAX_METHOD_ARRAY];
RecorderInfo * m_JitInfoArray;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_ModuleList could be allocated separately too for consistency, since that could also be a modestly sized array. My preference would be to keep that and m_JitInfoArray as inline arrays as before (MulticoreJitRecorder is new'ed anyway, and the behavior in some error conditions would be more consistent with before), though perhaps the inline arrays could be moved to be the last fields.

}
while (isMethod && count < MAX_WALKBACK);

_ASSERTE(count > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could fail if count happened to be 0 and hr == COR_E_BADIMAGEFORMAT from above. Probably need a SUCCEEDED(hr) here and in the if expression below.

unsigned curflags = curdata1 & METHOD_FLAGS_MASK;

unsigned curdata2 = * (((const unsigned *) pCurBuf) + 1);
unsigned cursignatureLength = curdata2 & MAX_SIGNATURE_LENGTH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should either be & (MAX_SIGNATURE_LENGTH - 1) (if its value is 0x10000), or & MAX_SIGNATURE_LENGTH (if its value is changed to 0xffff as I suggested in another comment). In any case it appears the length is not used, so it could just be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! I've fixed this typo, but forgot to update it here

hr = COR_E_BADIMAGEFORMAT;
}
unsigned moduleIndex = data1 & (MAX_MODULES - 1);
unsigned flags = data1 & METHOD_FLAGS_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not used and could be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll become used in next PR. Removed from this PR

unsigned moduleIndex = data1 & (MAX_MODULES - 1);
unsigned flags = data1 & METHOD_FLAGS_MASK;

unsigned signatureLength = data2 & MAX_SIGNATURE_LENGTH;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding & MAX_SIGNATURE_LENGTH or & (MAX_SIGNATURE_LENGTH - 1), could be removed

@@ -1149,12 +1139,9 @@ void MulticoreJitManager::SetProfileRoot(const WCHAR * pProfilePath)

#endif

if (g_SystemInfo.dwNumberOfProcessors >= 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option may be to make it configurable for now via clrconfigvalues.h and keep the default as it is currently. On CPU-limited environments where there is actually only one processor available, the background thread could otherwise steal a fair bit of CPU time from foreground work.

@kouvel kouvel added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) labels Apr 19, 2021
@gbalykov
Copy link
Member Author

gbalykov commented May 4, 2021

@noahfalk @kouvel Thanks for reviews! I've updated PR according to your notes.

I've also measured mcj profile size with different modes and here are the results. For all cases there were no difference in performance between different modes. So, with current version of PR profile size increases on just 26-31%, instead of 100% increase on initial version of this PR.

1. x64 debug, console helloworld, no ni.dll:

Mode Profile size, bytes
before this PR 5872
initial PR 11292
initial PR with removed overall record size (597c7c9) 9820
initial PR with removed overall record size and tokens for non-generic methods (b5cb5ae) 7412

2. tizen armel release, console helloworld, no ni.dll:

Mode Profile size, bytes
before this PR 3944
initial PR 7716
initial PR with removed overall record size (597c7c9) 6720
initial PR with removed overall record size and tokens for non-generic methods (b5cb5ae) 5060

3. tizen armel release, crossgen2 for crossgen2.dll, no ni.dll:

./corerun ./crossgen2/crossgen2.dll -r:`pwd`/*.dll -r:`pwd`/crossgen2/*.dll -O --parallelism=1 -o:/tmp/1.ni.dll `pwd`/crossgen2/crossgen2.dll
Mode Profile size, bytes
before this PR 54648
initial PR 113028
initial PR with removed overall record size (597c7c9) 95128
initial PR with removed overall record size and tokens for non-generic methods (b5cb5ae) 71648

Current multicorejit implementation in master has multiple flaws with NDirect methods:
- exception might be thrown inside GetStubForInteropMethod at some point for NDirect method, which will kill background thread, thus, reducing effectiveness of multicorejit (for example, occurs when multicorejit is used with crossgen2)
- some NDirect methods can lead to asserts during load inside GetStubForInteropMethod (for example, EvpMdCtxDestroy (0x6000044 token) from System.Security.Cryptography.Algorithms.dll)
…igure minimum allowed number of cpus for MultiCoreJit.

On arm with cpu hotplug it should be set to 1.
Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just one main comment and a couple of minor things. Thanks for sharing the data, that looks good to me as well.

{
for (LONG i = 0 ; i < m_GenericInfoCount; i++)
SigBuilder sigBuilder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved into the if block for generic methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -386,21 +386,25 @@ HRESULT MulticoreJitRecorder::WriteOutput(IStream * pStream)

HRESULT hr = S_OK;

// Preprocessing Generic Methods
// Preprocessing Methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional comments below are no longer valid, could be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}
else if (rcdTyp == MULTICOREJIT_GENERICMETHOD_RECORD_ID)
{
unsigned signatureLength = * (const unsigned short *) (((const unsigned *) pBuffer) + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the intention in this part to check if there is enough space in the remaining buffer before decoding. From the loop condition we know that there is a DWORD worth of space but this is reading past that, so should check the remaining buffer length before dereferencing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@kouvel
Copy link
Member

kouvel commented May 14, 2021

Failure looks unrelated, #52527

@kouvel kouvel merged commit 886a02c into dotnet:main May 14, 2021
@gbalykov
Copy link
Member Author

Thanks!

@mangod9
Copy link
Member

mangod9 commented May 14, 2021

Thanks for the contribution @gbalykov and @kouvel / @noahfalk for reviewing.

@karelz karelz added this to the 6.0.0 milestone May 20, 2021
kouvel added a commit to kouvel/runtime that referenced this pull request Jun 2, 2021
- When the recorder times out it doesn't actually stop profiling, but writes out the profile
- The app may later stop profiling, and then it tries to write the profile again
- PR dotnet#48326 fairly expected that the profile is only written once (some state is mutated)
- The non-timeout stop-profile path was also not stopping the timer
- Fix for dotnet#53014 in main
kouvel added a commit to kouvel/runtime that referenced this pull request Jun 2, 2021
- Port of dotnet#53573 to Preview 5
- When the recorder times out it doesn't actually stop profiling, but writes out the profile
- The app may later stop profiling, and then it tries to write the profile again
- PR dotnet#48326 fairly expected that the profile is only written once (some state is mutated)
- The non-timeout stop-profile path was also not stopping the timer
- Fixes dotnet#53014
kouvel added a commit that referenced this pull request Jun 2, 2021
* Fix assertion failure / crash in multi-core JIT

- When the recorder times out it doesn't actually stop profiling, but writes out the profile
- The app may later stop profiling, and then it tries to write the profile again
- PR #48326 fairly expected that the profile is only written once (some state is mutated)
- The non-timeout stop-profile path was also not stopping the timer
- Fix for #53014 in main
kouvel pushed a commit that referenced this pull request Jun 4, 2021
* Add background type preloading based on multicorejit

This is a second part of #48326 change, which enables handling of methods loaded from r2r images. Background thread of multicorejit now not only jits methods but also loads methods from R2R images. This allows to load types in background thread.

This is required as part of #45748 change (specifically, #45748 (comment)), goal of which is to enable background type preloading using multicorejit.
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants