Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient executable allocator #83632

Merged

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Mar 18, 2023

Improve the performance of the ExecutableAllocator via a series of changes which collectively reduce number of cache misses within the ExecutableAllocator::MapRW function and reduce the cost of such cache misses.

Adjustments to improve caching of RW pages:

  • Add a concept of caching with multiple entries. This cache is fairly easy to configure by adjusting the size of the ExecutableAllocator::m_cachedMapping[X] array. In this PR based on experimentation with various different sizes of cache, I have chosen a cache size of 3 which shows substantial improvement from the previous cache size of 1. This cache is somewhat less effective in heavily multithreaded applications, but still works substantially better than a cache of 1.
  • The various VSD stubs are move from a collection of LoaderHeap objects associated with the VirtualStubManager to using the CodeFragmentHeap. This has 2 benefits. The first of which is that stub type identification is now substantially faster for large applications, as it is able to use the RangeSectionMap to compute the stub kind instead of walking a singly linked list AKA the RangeList, and that for relatively small applications, the same page used to hold recently modified jitted code will also be used to hold recently allocated VSD stubs. This will reduce the number of different mappings that MapRW needs to process.
  • The size of page size used for interleaved LoaderHeap structures was raised to a minimum of 16KB from the 4KB that was used. This reduces the count of mappings for this sort of stub by a factor of 4, and makes it reasonable to skip caching the RW mapping for the interleaved structures, which reduces the load on the RW mapping cache.

Cache miss rates...

Application Original cache miss count New cache miss count
Powershell 1967 154
Crossgen2 2172 428
ASP.NET MVC template app 2169 138

Improvement to the performance of cache misses

  • The set of mappable regions in the ExecutableAllocator is organized as a singly linked list. In the change I made a very simple tweak to move the found region to the front of the list. This has resulted in substantial drop in the count of mappings walked to find the right one. Effectively, this treats the entire list of mappable regions as a MRU optimized list, which turns out to be a good assumption.
Application Original average walk length New walk length
Powershell 103.3 3.3
Crossgen2 74.5 2.4
ASP.NET MVC template app 101.3 2.4

… RW mapping created, as it is never accessed again, and caching it is evicting useful data from the cache
…tHeaps

- Should reduce the amount of contention on the ExecutableAllocator cache
- Will improve the performance of identifying what type of stub is in use by avoiding the RangeList structure
  - Note, this will only apply to stubs which are used in somewhat larger applications
@davidwrighton
Copy link
Member Author

Wrote more code... testing more stuff.

#if defined(TARGET_ARM64) && defined(TARGET_UNIX)
return max(16384, GetOsPageSize());
#elif defined(TARGET_ARM)
return 4096;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on why ARM is special here?

I also think it is typically more effective to use a multiply expression for amount. For example 4 * 1024 and 16 * 1024. The 4096 is rather common so fine, but 16384 isn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM is special as the instruction set does not let you have a simple immediate offset that is 16KB in length. This makes it impractical to use a 16KB offset as it causes measurable perf problems to add the extra instructions necessary to handle a 16KB offset.

@@ -4,6 +4,17 @@
#include "pedecoder.h"
#include "executableallocator.h"

#ifdef ENABLE_MAPRW_STATISTICS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this can overflow. I would change them to unsigned at the very least or perhaps uint64_t which should always be sufficient.

if (envString != NULL)
{
int customCacheSize = atoi(envString);
if(customCacheSize != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(customCacheSize != 0)
if (customCacheSize != 0)

if (index == 0)
return;

BlockRW*& cachedMapping = m_cachedMapping[index - 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a reference in this way is clearer. Ideally we would have an iterator, but we don't have that right now. I think it would be more appropriate to remove the reference and update line 338 to be m_cachedMapping[index - 1] = NULL.

{
for (size_t index = 0; index < EXECUTABLE_ALLOCATOR_CACHE_SIZE; index++)
{
BlockRW*& cachedMapping = m_cachedMapping[index];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BlockRW*& cachedMapping = m_cachedMapping[index];
BlockRW* cachedMapping = m_cachedMapping[index];

@@ -218,10 +235,44 @@ ExecutableAllocator::~ExecutableAllocator()
}
}

#ifdef ENABLE_MAPRW_STATISTICS
void DumpMapRWStatistics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be folded into LOG_EXECUTABLE_ALLOCATOR_STATISTICS? I'm not sure there is need for yet another logging output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I was just lazy and didn't need those.. but it should probably be joined up with them... especially as the scheme I wrote doesn't really work in all environments.

@@ -1648,7 +1649,7 @@ void UnlockedLoaderHeap::UnlockedBackoutMem(void *pMem,
if (IsInterleaved())
{
// Clear the RW page
memset((BYTE*)pMem + GetOsPageSize(), 0x00, dwSize); // Fill freed region with 0
memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0
memset((BYTE*)pMem + GetStubCodePageSize(), 0, dwSize); // Fill freed region with 0

@@ -5,7 +5,9 @@
#include "asmconstants.h"
#include "asmmacros.h"

#define DATA_SLOT(stub, field) (stub##Code + PAGE_SIZE + stub##Data__##field)
#define STUB_PAGE_SIZE 0x4000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define STUB_PAGE_SIZE 0x4000
#define STUB_PAGE_SIZE 16384

I think we should be consistent. Please tell me we can use decimal here. If we can't I am going to be very sad.

@@ -24,7 +24,7 @@ LEAF_END_MARKED FixupPrecodeCode
// NOTE: For LoongArch64 `CallCountingStubData__RemainingCallCountCell` must be zero !!!
// Because the stub-identifying token is $t1 within the `OnCallCountThresholdReachedStub`.
LEAF_ENTRY CallCountingStubCode
pcaddi $t2, 0x1000
pcaddi $t2, 0x4000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a constant here? Searching for PAGE_SIZE would help immensely for when to update things. Embedding this in here makes for troublesome updates and breaking people accidentally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't something we are building, then adding a comment at the top stating the implied PAGE_SIZE value is in the below code as 0x4000 or 16384.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't build this code, and frankly its the next best thing to impossible to predict if any given assembler will accept input that makes sense. I think this is likely to work, but it is a best guess.

Copy link
Contributor

@shushanhf shushanhf Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AaronRobinsonMSFT @davidwrighton
This is not right for LA64.
I will revert the modification for LA64. #84960

Copy link
Contributor

@shushanhf shushanhf Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LA64, the OS-PageSize is 16k, the pcaddi $t2, 0x1000 is right.
But the pcaddi $t2, 0x4000 is 64K offset.

Comment on lines +567 to +568
DWORD cPagesPerHeap = cWastedPages / 2;
DWORD cPagesRemainder = cWastedPages % 2; // We'll throw this at the cache entry heap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarifying comments on the value 2 would be helpful. Is it possible we can compute this based on the GetOsPageSize() or is this simply a value we use for an optimization? We converted from 4k to 16k, factor of 4, but we changed this by a factor 3 so that is where I am coming from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about going from having 6 LoaderHeap structures as part of a VirtualCallStubManager to only having 2.

@davidwrighton davidwrighton marked this pull request as ready for review March 24, 2023 00:59
src/coreclr/utilcode/loaderheap.cpp Show resolved Hide resolved
src/coreclr/utilcode/loaderheap.cpp Show resolved Hide resolved
@@ -5,7 +5,8 @@
#include "unixasmmacros.inc"
#include "asmconstants.h"

PAGE_SIZE = 4096
; PAGE_SIZE must match the behavior of GetStubCodePageSize() on this architecture/os
PAGE_SIZE = 16384
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please name this STUB_PAGE_SIZE like in the thunktemplates.asm?

@@ -297,11 +297,11 @@ void (*CallCountingStub::CallCountingStubCode)();
void CallCountingStub::StaticInitialize()
{
#if defined(TARGET_ARM64) && defined(TARGET_UNIX)
int pageSize = GetOsPageSize();
int pageSize = GetStubCodePageSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it would be good to modify the ENUM_PAGE_SIZES too to not to enumerate sizes smaller than the 16kB when they would never be used. And if such change was made, you'd also want to modify the thunktemplates.S for arm64 to generate less variants here:

.irp PAGE_SIZE, 4096, 8192, 16384, 32768, 65536

@@ -5,7 +5,7 @@
#include "unixasmmacros.inc"
#include "asmconstants.h"

PAGE_SIZE = 4096
PAGE_SIZE = 16384
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've mentioned in my other comment, I believe we should keep all 32 bit architectures on the 4096 page size for the limited VM space reasons.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that x86 doesn't benefit from keeping the 4096 page size, as it only runs on Windows which has 64KB allocation granularity for memory reservation in the ExecutableAllocator in any case, thus resulting in no change to the available VM space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we also have Linux x86 (not official, but used by the Samsung folks). As for windows x86, I was somehow under the impression that it was not using the 64kB granularity, but I can see I was wrong. However, larger page means potentially wasting more physical memory (if you allocate 16kB page and end up allocating just a fraction of it). Not sure if it is worth it, but I'll leave it up to you.

@@ -4,6 +4,13 @@
#include "pedecoder.h"
#include "executableallocator.h"

#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this empty ifdef?

#endif

#ifdef VARIABLE_SIZED_CACHEDMAPPING_SIZE
static int ExecutableAllocator_CachedMappingSize = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this a static member of the ExecutableAllocator instead of this name prefixing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwrighton it seems that you have added the ExecutableAllocator::g_cachedMappingSize, but you still kept the ExecutableAllocator_CachedMappingSize, it seems to me that the latter should be replaced everywhere by the former.

@@ -91,6 +108,12 @@ void ExecutableAllocator::DumpHolderUsage()
fprintf(stderr, "Reserve count: %lld\n", g_reserveCount);
fprintf(stderr, "Release count: %lld\n", g_releaseCount);

printf("g_MapRW_Calls: %lld\n", g_MapRW_Calls);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change these to fprintf(stderr, like the rest of the logging uses?

src/coreclr/utilcode/executableallocator.cpp Show resolved Hide resolved
@@ -127,8 +127,12 @@ class ExecutableAllocator
// If variable sized mappings enabled, make the cache physically big enough to cover all interesting sizes
static int g_cachedMappingSize;
BlockRW* m_cachedMapping[16] = { 0 };
#else
#if defined(HOST_OSX) && defined(HOST_AMD64)
BlockRW* m_cachedMapping[1] = { 0 }; // OSX Amd64 doesn't behave correctly with more than one cached mapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David, I can see some coreclr tests in the CI failing on x64 macOS with illegal instruction, so maybe the failure you are seeing is unrelated to your changes. This is from my recent PR:

/private/tmp/helix/working/A18108BE/w/B44209DD/e /private/tmp/helix/working/A18108BE/w/B44209DD/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 9140 of 9186 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 168: 76212 Illegal instruction: 4  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=AdditionalTimezoneChecks -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/A18108BE/w/B44209DD/e
----- end Wed Mar 29 16:33:56 EDT 2023 ----- exit code 132 ----------------------------------------------------------
exit code 132 means SIGILL Illegal Instruction. Core dumped. Likely codegen issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janvorli This looks like #84010 that was fixed 2 days ago.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems likely. I've reverted my cache size tweak. @janvorli could you review this change and sign off? I'm looking to check this in today/Monday. If you think I need someone else to signoff on the type system stuff, please let me know.

…gal instruction signal will go away in our testing"

This reverts commit 9838460.
@janvorli
Copy link
Member

janvorli commented Apr 5, 2023

@davidwrighton it looks good to me except for the #83632 (comment)

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@janvorli janvorli merged commit 11a0671 into dotnet:main Apr 13, 2023
shushanhf added a commit to shushanhf/runtime that referenced this pull request Apr 18, 2023
janvorli pushed a commit that referenced this pull request Apr 20, 2023
…3632. (#84960)

* [LoongArch64] revert the modification about
the LA64's PageSize by #83632.

* keep the comment by CR.
@ghost ghost locked as resolved and limited conversation to collaborators May 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants