-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient executable allocator #83632
More efficient executable allocator #83632
Conversation
… RW mapping created, as it is never accessed again, and caching it is evicting useful data from the cache
…able pages that aren't cached ever
…tHeaps - Should reduce the amount of contention on the ExecutableAllocator cache - Will improve the performance of identifying what type of stub is in use by avoiding the RangeList structure - Note, this will only apply to stubs which are used in somewhat larger applications
Wrote more code... testing more stuff. |
- Notably, arm32 is now only using 4K pages as before, as it can't generate the proper immediate as needed.
…f which stubs to use
src/coreclr/inc/loaderheap.h
Outdated
#if defined(TARGET_ARM64) && defined(TARGET_UNIX) | ||
return max(16384, GetOsPageSize()); | ||
#elif defined(TARGET_ARM) | ||
return 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on why ARM is special here?
I also think it is typically more effective to use a multiply expression for amount. For example 4 * 1024
and 16 * 1024
. The 4096
is rather common so fine, but 16384
isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARM is special as the instruction set does not let you have a simple immediate offset that is 16KB in length. This makes it impractical to use a 16KB offset as it causes measurable perf problems to add the extra instructions necessary to handle a 16KB offset.
@@ -4,6 +4,17 @@ | |||
#include "pedecoder.h" | |||
#include "executableallocator.h" | |||
|
|||
#ifdef ENABLE_MAPRW_STATISTICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this can overflow. I would change them to unsigned
at the very least or perhaps uint64_t
which should always be sufficient.
if (envString != NULL) | ||
{ | ||
int customCacheSize = atoi(envString); | ||
if(customCacheSize != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(customCacheSize != 0) | |
if (customCacheSize != 0) |
if (index == 0) | ||
return; | ||
|
||
BlockRW*& cachedMapping = m_cachedMapping[index - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think having a reference in this way is clearer. Ideally we would have an iterator, but we don't have that right now. I think it would be more appropriate to remove the reference and update line 338 to be m_cachedMapping[index - 1] = NULL
.
{ | ||
for (size_t index = 0; index < EXECUTABLE_ALLOCATOR_CACHE_SIZE; index++) | ||
{ | ||
BlockRW*& cachedMapping = m_cachedMapping[index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockRW*& cachedMapping = m_cachedMapping[index]; | |
BlockRW* cachedMapping = m_cachedMapping[index]; |
@@ -218,10 +235,44 @@ ExecutableAllocator::~ExecutableAllocator() | |||
} | |||
} | |||
|
|||
#ifdef ENABLE_MAPRW_STATISTICS | |||
void DumpMapRWStatistics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be folded into LOG_EXECUTABLE_ALLOCATOR_STATISTICS
? I'm not sure there is need for yet another logging output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I was just lazy and didn't need those.. but it should probably be joined up with them... especially as the scheme I wrote doesn't really work in all environments.
@@ -1648,7 +1649,7 @@ void UnlockedLoaderHeap::UnlockedBackoutMem(void *pMem, | |||
if (IsInterleaved()) | |||
{ | |||
// Clear the RW page | |||
memset((BYTE*)pMem + GetOsPageSize(), 0x00, dwSize); // Fill freed region with 0 | |||
memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0 | |
memset((BYTE*)pMem + GetStubCodePageSize(), 0, dwSize); // Fill freed region with 0 |
@@ -5,7 +5,9 @@ | |||
#include "asmconstants.h" | |||
#include "asmmacros.h" | |||
|
|||
#define DATA_SLOT(stub, field) (stub##Code + PAGE_SIZE + stub##Data__##field) | |||
#define STUB_PAGE_SIZE 0x4000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define STUB_PAGE_SIZE 0x4000 | |
#define STUB_PAGE_SIZE 16384 |
I think we should be consistent. Please tell me we can use decimal here. If we can't I am going to be very sad.
@@ -24,7 +24,7 @@ LEAF_END_MARKED FixupPrecodeCode | |||
// NOTE: For LoongArch64 `CallCountingStubData__RemainingCallCountCell` must be zero !!! | |||
// Because the stub-identifying token is $t1 within the `OnCallCountThresholdReachedStub`. | |||
LEAF_ENTRY CallCountingStubCode | |||
pcaddi $t2, 0x1000 | |||
pcaddi $t2, 0x4000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a constant here? Searching for PAGE_SIZE
would help immensely for when to update things. Embedding this in here makes for troublesome updates and breaking people accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't something we are building, then adding a comment at the top stating the implied PAGE_SIZE
value is in the below code as 0x4000
or 16384
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't build this code, and frankly its the next best thing to impossible to predict if any given assembler will accept input that makes sense. I think this is likely to work, but it is a best guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AaronRobinsonMSFT @davidwrighton
This is not right for LA64.
I will revert the modification for LA64. #84960
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LA64, the OS-PageSize is 16k, the pcaddi $t2, 0x1000
is right.
But the pcaddi $t2, 0x4000
is 64K offset.
DWORD cPagesPerHeap = cWastedPages / 2; | ||
DWORD cPagesRemainder = cWastedPages % 2; // We'll throw this at the cache entry heap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clarifying comments on the value 2
would be helpful. Is it possible we can compute this based on the GetOsPageSize()
or is this simply a value we use for an optimization? We converted from 4k to 16k, factor of 4, but we changed this by a factor 3 so that is where I am coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about going from having 6 LoaderHeap structures as part of a VirtualCallStubManager to only having 2.
@@ -5,7 +5,8 @@ | |||
#include "unixasmmacros.inc" | |||
#include "asmconstants.h" | |||
|
|||
PAGE_SIZE = 4096 | |||
; PAGE_SIZE must match the behavior of GetStubCodePageSize() on this architecture/os | |||
PAGE_SIZE = 16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please name this STUB_PAGE_SIZE like in the thunktemplates.asm?
@@ -297,11 +297,11 @@ void (*CallCountingStub::CallCountingStubCode)(); | |||
void CallCountingStub::StaticInitialize() | |||
{ | |||
#if defined(TARGET_ARM64) && defined(TARGET_UNIX) | |||
int pageSize = GetOsPageSize(); | |||
int pageSize = GetStubCodePageSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would be good to modify the ENUM_PAGE_SIZES too to not to enumerate sizes smaller than the 16kB when they would never be used. And if such change was made, you'd also want to modify the thunktemplates.S for arm64 to generate less variants here:
.irp PAGE_SIZE, 4096, 8192, 16384, 32768, 65536 |
@@ -5,7 +5,7 @@ | |||
#include "unixasmmacros.inc" | |||
#include "asmconstants.h" | |||
|
|||
PAGE_SIZE = 4096 | |||
PAGE_SIZE = 16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've mentioned in my other comment, I believe we should keep all 32 bit architectures on the 4096 page size for the limited VM space reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that x86 doesn't benefit from keeping the 4096 page size, as it only runs on Windows which has 64KB allocation granularity for memory reservation in the ExecutableAllocator in any case, thus resulting in no change to the available VM space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we also have Linux x86 (not official, but used by the Samsung folks). As for windows x86, I was somehow under the impression that it was not using the 64kB granularity, but I can see I was wrong. However, larger page means potentially wasting more physical memory (if you allocate 16kB page and end up allocating just a fraction of it). Not sure if it is worth it, but I'll leave it up to you.
@@ -4,6 +4,13 @@ | |||
#include "pedecoder.h" | |||
#include "executableallocator.h" | |||
|
|||
#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this empty ifdef?
#endif | ||
|
||
#ifdef VARIABLE_SIZED_CACHEDMAPPING_SIZE | ||
static int ExecutableAllocator_CachedMappingSize = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this a static member of the ExecutableAllocator instead of this name prefixing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwrighton it seems that you have added the ExecutableAllocator::g_cachedMappingSize, but you still kept the ExecutableAllocator_CachedMappingSize, it seems to me that the latter should be replaced everywhere by the former.
@@ -91,6 +108,12 @@ void ExecutableAllocator::DumpHolderUsage() | |||
fprintf(stderr, "Reserve count: %lld\n", g_reserveCount); | |||
fprintf(stderr, "Release count: %lld\n", g_releaseCount); | |||
|
|||
printf("g_MapRW_Calls: %lld\n", g_MapRW_Calls); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please change these to fprintf(stderr,
like the rest of the logging uses?
…ruction signal will go away in our testing
@@ -127,8 +127,12 @@ class ExecutableAllocator | |||
// If variable sized mappings enabled, make the cache physically big enough to cover all interesting sizes | |||
static int g_cachedMappingSize; | |||
BlockRW* m_cachedMapping[16] = { 0 }; | |||
#else | |||
#if defined(HOST_OSX) && defined(HOST_AMD64) | |||
BlockRW* m_cachedMapping[1] = { 0 }; // OSX Amd64 doesn't behave correctly with more than one cached mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
David, I can see some coreclr tests in the CI failing on x64 macOS with illegal instruction, so maybe the failure you are seeing is unrelated to your changes. This is from my recent PR:
/private/tmp/helix/working/A18108BE/w/B44209DD/e /private/tmp/helix/working/A18108BE/w/B44209DD/e
Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.Runtime.Tests (found 9140 of 9186 test cases)
Starting: System.Runtime.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 168: 76212 Illegal instruction: 4 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=AdditionalTimezoneChecks -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/A18108BE/w/B44209DD/e
----- end Wed Mar 29 16:33:56 EDT 2023 ----- exit code 132 ----------------------------------------------------------
exit code 132 means SIGILL Illegal Instruction. Core dumped. Likely codegen issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems likely. I've reverted my cache size tweak. @janvorli could you review this change and sign off? I'm looking to check this in today/Monday. If you think I need someone else to signoff on the type system stuff, please let me know.
…gal instruction signal will go away in our testing" This reverts commit 9838460.
@davidwrighton it looks good to me except for the #83632 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
the LA64's PageSize by dotnet#83632.
Improve the performance of the
ExecutableAllocator
via a series of changes which collectively reduce number of cache misses within theExecutableAllocator::MapRW
function and reduce the cost of such cache misses.Adjustments to improve caching of RW pages:
ExecutableAllocator::m_cachedMapping[X]
array. In this PR based on experimentation with various different sizes of cache, I have chosen a cache size of 3 which shows substantial improvement from the previous cache size of 1. This cache is somewhat less effective in heavily multithreaded applications, but still works substantially better than a cache of 1.LoaderHeap
objects associated with the VirtualStubManager to using theCodeFragmentHeap
. This has 2 benefits. The first of which is that stub type identification is now substantially faster for large applications, as it is able to use theRangeSectionMap
to compute the stub kind instead of walking a singly linked list AKA theRangeList
, and that for relatively small applications, the same page used to hold recently modified jitted code will also be used to hold recently allocated VSD stubs. This will reduce the number of different mappings thatMapRW
needs to process.LoaderHeap
structures was raised to a minimum of 16KB from the 4KB that was used. This reduces the count of mappings for this sort of stub by a factor of 4, and makes it reasonable to skip caching the RW mapping for the interleaved structures, which reduces the load on the RW mapping cache.Cache miss rates...
Improvement to the performance of cache misses
ExecutableAllocator
is organized as a singly linked list. In the change I made a very simple tweak to move the found region to the front of the list. This has resulted in substantial drop in the count of mappings walked to find the right one. Effectively, this treats the entire list of mappable regions as a MRU optimized list, which turns out to be a good assumption.