More efficient executable allocator #83632

davidwrighton · 2023-03-18T01:08:10Z

Improve the performance of the ExecutableAllocator via a series of changes which collectively reduce number of cache misses within the ExecutableAllocator::MapRW function and reduce the cost of such cache misses.

Adjustments to improve caching of RW pages:

Add a concept of caching with multiple entries. This cache is fairly easy to configure by adjusting the size of the ExecutableAllocator::m_cachedMapping[X] array. In this PR based on experimentation with various different sizes of cache, I have chosen a cache size of 3 which shows substantial improvement from the previous cache size of 1. This cache is somewhat less effective in heavily multithreaded applications, but still works substantially better than a cache of 1.
The various VSD stubs are move from a collection of LoaderHeap objects associated with the VirtualStubManager to using the CodeFragmentHeap. This has 2 benefits. The first of which is that stub type identification is now substantially faster for large applications, as it is able to use the RangeSectionMap to compute the stub kind instead of walking a singly linked list AKA the RangeList, and that for relatively small applications, the same page used to hold recently modified jitted code will also be used to hold recently allocated VSD stubs. This will reduce the number of different mappings that MapRW needs to process.
The size of page size used for interleaved LoaderHeap structures was raised to a minimum of 16KB from the 4KB that was used. This reduces the count of mappings for this sort of stub by a factor of 4, and makes it reasonable to skip caching the RW mapping for the interleaved structures, which reduces the load on the RW mapping cache.

Cache miss rates...

Application	Original cache miss count	New cache miss count
Powershell	1967	154
Crossgen2	2172	428
ASP.NET MVC template app	2169	138

Improvement to the performance of cache misses

The set of mappable regions in the ExecutableAllocator is organized as a singly linked list. In the change I made a very simple tweak to move the found region to the front of the list. This has resulted in substantial drop in the count of mappings walked to find the right one. Effectively, this treats the entire list of mappable regions as a MRU optimized list, which turns out to be a good assumption.

Application	Original average walk length	New walk length
Powershell	103.3	3.3
Crossgen2	74.5	2.4
ASP.NET MVC template app	101.3	2.4

… RW mapping created, as it is never accessed again, and caching it is evicting useful data from the cache

…able pages that aren't cached ever

…tHeaps - Should reduce the amount of contention on the ExecutableAllocator cache - Will improve the performance of identifying what type of stub is in use by avoiding the RangeList structure - Note, this will only apply to stubs which are used in somewhat larger applications

… cache size exploration

davidwrighton · 2023-03-22T01:01:00Z

Wrote more code... testing more stuff.

- Notably, arm32 is now only using 4K pages as before, as it can't generate the proper immediate as needed.

…f which stubs to use

AaronRobinsonMSFT · 2023-03-23T16:18:17Z

src/coreclr/inc/loaderheap.h

+#if defined(TARGET_ARM64) && defined(TARGET_UNIX)
+    return max(16384, GetOsPageSize());
+#elif defined(TARGET_ARM)
+    return 4096;


Can you add a comment on why ARM is special here?

I also think it is typically more effective to use a multiply expression for amount. For example 4 * 1024 and 16 * 1024. The 4096 is rather common so fine, but 16384 isn't.

ARM is special as the instruction set does not let you have a simple immediate offset that is 16KB in length. This makes it impractical to use a 16KB offset as it causes measurable perf problems to add the extra instructions necessary to handle a 16KB offset.

AaronRobinsonMSFT · 2023-03-23T16:20:20Z

src/coreclr/utilcode/executableallocator.cpp

@@ -4,6 +4,17 @@
 #include "pedecoder.h"
 #include "executableallocator.h"

+#ifdef ENABLE_MAPRW_STATISTICS


I assume this can overflow. I would change them to unsigned at the very least or perhaps uint64_t which should always be sufficient.

AaronRobinsonMSFT · 2023-03-23T16:21:26Z

src/coreclr/utilcode/executableallocator.cpp

+    if (envString != NULL)
+    {
+        int customCacheSize = atoi(envString);
+        if(customCacheSize != 0)


Suggested change

if(customCacheSize != 0)

if (customCacheSize != 0)

AaronRobinsonMSFT · 2023-03-23T16:26:06Z

src/coreclr/utilcode/executableallocator.cpp

+    if (index == 0)
+        return;
+
+    BlockRW*& cachedMapping = m_cachedMapping[index - 1];


I don't think having a reference in this way is clearer. Ideally we would have an iterator, but we don't have that right now. I think it would be more appropriate to remove the reference and update line 338 to be m_cachedMapping[index - 1] = NULL.

AaronRobinsonMSFT · 2023-03-23T16:26:25Z

src/coreclr/utilcode/executableallocator.cpp

+{
+    for (size_t index = 0; index < EXECUTABLE_ALLOCATOR_CACHE_SIZE; index++)
+    {
+        BlockRW*& cachedMapping = m_cachedMapping[index];


Suggested change

BlockRW*& cachedMapping = m_cachedMapping[index];

BlockRW* cachedMapping = m_cachedMapping[index];

AaronRobinsonMSFT · 2023-03-23T16:48:25Z

src/coreclr/utilcode/executableallocator.cpp

@@ -218,10 +235,44 @@ ExecutableAllocator::~ExecutableAllocator()
    }
 }

+#ifdef ENABLE_MAPRW_STATISTICS
+void DumpMapRWStatistics()


Could this just be folded into LOG_EXECUTABLE_ALLOCATOR_STATISTICS? I'm not sure there is need for yet another logging output.

Yeah. I was just lazy and didn't need those.. but it should probably be joined up with them... especially as the scheme I wrote doesn't really work in all environments.

AaronRobinsonMSFT · 2023-03-23T16:49:56Z

src/coreclr/utilcode/loaderheap.cpp

@@ -1648,7 +1649,7 @@ void UnlockedLoaderHeap::UnlockedBackoutMem(void *pMem,
        if (IsInterleaved())
        {
            // Clear the RW page
-            memset((BYTE*)pMem + GetOsPageSize(), 0x00, dwSize); // Fill freed region with 0
+            memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0


Suggested change

memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0

memset((BYTE*)pMem + GetStubCodePageSize(), 0, dwSize); // Fill freed region with 0

AaronRobinsonMSFT · 2023-03-23T16:53:24Z

src/coreclr/vm/arm64/thunktemplates.asm

@@ -5,7 +5,9 @@
 #include "asmconstants.h"
 #include "asmmacros.h"

-#define DATA_SLOT(stub, field) (stub##Code + PAGE_SIZE + stub##Data__##field)
+#define STUB_PAGE_SIZE 0x4000


Suggested change

#define STUB_PAGE_SIZE 0x4000

#define STUB_PAGE_SIZE 16384

I think we should be consistent. Please tell me we can use decimal here. If we can't I am going to be very sad.

AaronRobinsonMSFT · 2023-03-23T16:57:44Z

src/coreclr/vm/loongarch64/thunktemplates.S

@@ -24,7 +24,7 @@ LEAF_END_MARKED FixupPrecodeCode
 // NOTE: For LoongArch64 `CallCountingStubData__RemainingCallCountCell` must be zero !!!
 // Because the stub-identifying token is $t1 within the `OnCallCountThresholdReachedStub`.
 LEAF_ENTRY CallCountingStubCode
-    pcaddi  $t2, 0x1000
+    pcaddi  $t2, 0x4000


Can we use a constant here? Searching for PAGE_SIZE would help immensely for when to update things. Embedding this in here makes for troublesome updates and breaking people accidentally.

If this isn't something we are building, then adding a comment at the top stating the implied PAGE_SIZE value is in the below code as 0x4000 or 16384.

We don't build this code, and frankly its the next best thing to impossible to predict if any given assembler will accept input that makes sense. I think this is likely to work, but it is a best guess.

@AaronRobinsonMSFT @davidwrighton
This is not right for LA64.
I will revert the modification for LA64. #84960

For LA64, the OS-PageSize is 16k, the pcaddi $t2, 0x1000 is right.
But the pcaddi $t2, 0x4000 is 64K offset.

AaronRobinsonMSFT · 2023-03-23T17:06:17Z

src/coreclr/vm/virtualcallstub.cpp

+                DWORD cPagesPerHeap = cWastedPages / 2;
+                DWORD cPagesRemainder = cWastedPages % 2; // We'll throw this at the cache entry heap


Some clarifying comments on the value 2 would be helpful. Is it possible we can compute this based on the GetOsPageSize() or is this simply a value we use for an optimization? We converted from 4k to 16k, factor of 4, but we changed this by a factor 3 so that is where I am coming from.

This is about going from having 6 LoaderHeap structures as part of a VirtualCallStubManager to only having 2.

src/coreclr/utilcode/loaderheap.cpp

janvorli · 2023-03-24T06:24:12Z

src/coreclr/vm/amd64/thunktemplates.S

@@ -5,7 +5,8 @@
 #include "unixasmmacros.inc"
 #include "asmconstants.h"

-PAGE_SIZE = 4096
+; PAGE_SIZE must match the behavior of GetStubCodePageSize() on this architecture/os
+PAGE_SIZE = 16384


Can you please name this STUB_PAGE_SIZE like in the thunktemplates.asm?

janvorli · 2023-03-24T06:42:34Z

src/coreclr/vm/callcounting.cpp

@@ -297,11 +297,11 @@ void (*CallCountingStub::CallCountingStubCode)();
 void CallCountingStub::StaticInitialize()
 {
 #if defined(TARGET_ARM64) && defined(TARGET_UNIX)
-    int pageSize = GetOsPageSize();
+    int pageSize = GetStubCodePageSize();


It seems it would be good to modify the ENUM_PAGE_SIZES too to not to enumerate sizes smaller than the 16kB when they would never be used. And if such change was made, you'd also want to modify the thunktemplates.S for arm64 to generate less variants here:

runtime/src/coreclr/vm/arm64/thunktemplates.S

Line 9 in dce3439

.irp PAGE_SIZE, 4096, 8192, 16384, 32768, 65536

janvorli · 2023-03-24T06:45:36Z

src/coreclr/vm/i386/thunktemplates.S

@@ -5,7 +5,7 @@
 #include "unixasmmacros.inc"
 #include "asmconstants.h"

-PAGE_SIZE = 4096
+PAGE_SIZE = 16384


As I've mentioned in my other comment, I believe we should keep all 32 bit architectures on the 4096 page size for the limited VM space reasons.

My understanding was that x86 doesn't benefit from keeping the 4096 page size, as it only runs on Windows which has 64KB allocation granularity for memory reservation in the ExecutableAllocator in any case, thus resulting in no change to the available VM space.

But we also have Linux x86 (not official, but used by the Samsung folks). As for windows x86, I was somehow under the impression that it was not using the 64kB granularity, but I can see I was wrong. However, larger page means potentially wasting more physical memory (if you allocate 16kB page and end up allocating just a fraction of it). Not sure if it is worth it, but I'll leave it up to you.

janvorli · 2023-03-24T06:55:39Z

src/coreclr/utilcode/executableallocator.cpp

@@ -4,6 +4,13 @@
 #include "pedecoder.h"
 #include "executableallocator.h"

+#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS


Remove this empty ifdef?

janvorli · 2023-03-24T06:57:43Z

src/coreclr/utilcode/executableallocator.cpp

+#endif
+
+#ifdef VARIABLE_SIZED_CACHEDMAPPING_SIZE
+static int ExecutableAllocator_CachedMappingSize = 1;


Can you please make this a static member of the ExecutableAllocator instead of this name prefixing?

@davidwrighton it seems that you have added the ExecutableAllocator::g_cachedMappingSize, but you still kept the ExecutableAllocator_CachedMappingSize, it seems to me that the latter should be replaced everywhere by the former.

janvorli · 2023-03-24T06:58:29Z

src/coreclr/utilcode/executableallocator.cpp

@@ -91,6 +108,12 @@ void ExecutableAllocator::DumpHolderUsage()
    fprintf(stderr, "Reserve count: %lld\n", g_reserveCount);
    fprintf(stderr, "Release count: %lld\n", g_releaseCount);

+    printf("g_MapRW_Calls: %lld\n", g_MapRW_Calls);


Can you please change these to fprintf(stderr, like the rest of the logging uses?

src/coreclr/utilcode/executableallocator.cpp

…ruction signal will go away in our testing

janvorli · 2023-03-30T13:48:06Z

src/coreclr/inc/executableallocator.h

@@ -127,8 +127,12 @@ class ExecutableAllocator
    // If variable sized mappings enabled, make the cache physically big enough to cover all interesting sizes
    static int g_cachedMappingSize;
    BlockRW* m_cachedMapping[16] = { 0 };
+#else
+#if defined(HOST_OSX) && defined(HOST_AMD64)
+    BlockRW* m_cachedMapping[1] = { 0 }; // OSX Amd64 doesn't behave correctly with more than one cached mapping.


David, I can see some coreclr tests in the CI failing on x64 macOS with illegal instruction, so maybe the failure you are seeing is unrelated to your changes. This is from my recent PR:

/private/tmp/helix/working/A18108BE/w/B44209DD/e /private/tmp/helix/working/A18108BE/w/B44209DD/e Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None) Discovered: System.Runtime.Tests (found 9140 of 9186 test cases) Starting: System.Runtime.Tests (parallel test collections = on, max threads = 4) ./RunTests.sh: line 168: 76212 Illegal instruction: 4 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=AdditionalTimezoneChecks -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE /private/tmp/helix/working/A18108BE/w/B44209DD/e ----- end Wed Mar 29 16:33:56 EDT 2023 ----- exit code 132 ---------------------------------------------------------- exit code 132 means SIGILL Illegal Instruction. Core dumped. Likely codegen issue.

@janvorli This looks like #84010 that was fixed 2 days ago.

Seems likely. I've reverted my cache size tweak. @janvorli could you review this change and sign off? I'm looking to check this in today/Monday. If you think I need someone else to signoff on the type system stuff, please let me know.

…gal instruction signal will go away in our testing" This reverts commit 9838460.

janvorli · 2023-04-05T13:47:36Z

@davidwrighton it looks good to me except for the #83632 (comment)

…executable_allocator

janvorli

LGTM, thank you!

the LA64's PageSize by dotnet#83632.

…3632. (#84960) * [LoongArch64] revert the modification about the LA64's PageSize by #83632. * keep the comment by CR.

davidwrighton added 6 commits March 15, 2023 15:34

Add multi-element caching scheme to ExecutableAllocator

507e19d

CodePageGenerator generated code for LoaderHeaps should not cache the…

819930f

… RW mapping created, as it is never accessed again, and caching it is evicting useful data from the cache

Fix build breaks introduced by executable mapping cache changes

3b69965

Fix build breaks caused by changes to introduce the concept of execut…

b3d6fc1

…able pages that aren't cached ever

Add statistics gathering features to ExecutableAllocator

b5973d5

dotnet-issue-labeler bot added the area-VM-coreclr label Mar 18, 2023

ghost assigned davidwrighton Mar 18, 2023

In progress

8ea3b48

davidwrighton closed this Mar 21, 2023

davidwrighton added 5 commits March 21, 2023 15:58

Fix Dac api failing when called early in process startup

762bcf0

Implement interleaved stubs as 16KB pages instead of 4KB pages

64da1cc

Remove api incorrectly added

306fa57

Adjust cache size down to 3, and leave a breadcrumb for enabling more…

b361aae

… cache size exploration

Fix x86 build

b5901a2

davidwrighton reopened this Mar 22, 2023

build-analysis bot mentioned this pull request Mar 22, 2023

[release/6.0] Doublelinklist GC failures on Mono #83245

Closed

This was referenced Mar 22, 2023

Infra improvements for Helix #68176

Closed

Methodical_others test JIT/Methodical/Coverage/copy_prop_byref_to_native_int crashing #69832

Open

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

davidwrighton added 5 commits March 22, 2023 13:11

Tweaks to make it all build and fix some bugs

6640678

- Notably, arm32 is now only using 4K pages as before, as it can't generate the proper immediate as needed.

Add statistics for linked list walk lengths

3567881

Reorder linked list on access

6cdd4f7

Fix some more asserts and build breaks

b79081d

Fix Arm build for real this time, and fix unix arm64 miscalculation o…

c7bef3a

…f which stubs to use

AaronRobinsonMSFT reviewed Mar 23, 2023

View reviewed changes

Update based on code review comments

215b571

davidwrighton marked this pull request as ready for review March 24, 2023 00:59

davidwrighton requested a review from janvorli March 24, 2023 00:59

janvorli reviewed Mar 24, 2023

View reviewed changes

davidwrighton added 2 commits March 24, 2023 10:51

More code review feedback

7527070

Fix oops

a9e173d

build-analysis bot mentioned this pull request Mar 24, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

davidwrighton added 2 commits March 27, 2023 13:36

Attempt to fix Unix Arm64 build

89bef94

Try tweaking the number of cached mappings to see if the illegal inst…

9838460

…ruction signal will go away in our testing

build-analysis bot mentioned this pull request Mar 30, 2023

Tracking issue for CI build timeouts #76454

Closed

janvorli reviewed Mar 30, 2023

View reviewed changes

Revert "Try tweaking the number of cached mappings to see if the ille…

f6f08c9

…gal instruction signal will go away in our testing" This reverts commit 9838460.

davidwrighton added 2 commits April 12, 2023 15:11

Fix last code review comment

df651dc

Merge branch 'main' of github.com:dotnet/runtime into more_efficient_…

630dc3f

…executable_allocator

janvorli approved these changes Apr 13, 2023

View reviewed changes

janvorli merged commit 11a0671 into dotnet:main Apr 13, 2023

shushanhf added a commit to shushanhf/runtime that referenced this pull request Apr 18, 2023

[LoongArch64] revert the modification about

87ae579

the LA64's PageSize by dotnet#83632.

shushanhf mentioned this pull request Apr 18, 2023

[LoongArch64] revert the modification about the LA64's PageSize by #83632. #84960

Merged

janvorli pushed a commit that referenced this pull request Apr 20, 2023

[LoongArch64] revert the modification about the LA64's PageSize by #8…

890d6e3

…3632. (#84960) * [LoongArch64] revert the modification about the LA64's PageSize by #83632. * keep the comment by CR.

ghost locked as resolved and limited conversation to collaborators May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient executable allocator #83632

More efficient executable allocator #83632

davidwrighton commented Mar 18, 2023 •

edited

Loading

davidwrighton commented Mar 22, 2023

AaronRobinsonMSFT Mar 23, 2023

davidwrighton Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

davidwrighton Mar 24, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

AaronRobinsonMSFT Mar 23, 2023

davidwrighton Mar 24, 2023

shushanhf Apr 17, 2023 •

edited

Loading

shushanhf Apr 17, 2023 •

edited

Loading

AaronRobinsonMSFT Mar 23, 2023

davidwrighton Mar 24, 2023

janvorli Mar 24, 2023

janvorli Mar 24, 2023

janvorli Mar 24, 2023

davidwrighton Mar 24, 2023

janvorli Mar 24, 2023

janvorli Mar 24, 2023

janvorli Mar 24, 2023

janvorli Mar 31, 2023

janvorli Mar 24, 2023

janvorli Mar 30, 2023

jkotas Mar 30, 2023

davidwrighton Mar 31, 2023

janvorli commented Apr 5, 2023

janvorli left a comment

	BlockRW*& cachedMapping = m_cachedMapping[index];
	BlockRW* cachedMapping = m_cachedMapping[index];

	memset((BYTE*)pMem + GetStubCodePageSize(), 0x00, dwSize); // Fill freed region with 0
	memset((BYTE*)pMem + GetStubCodePageSize(), 0, dwSize); // Fill freed region with 0

		DWORD cPagesPerHeap = cWastedPages / 2;
		DWORD cPagesRemainder = cWastedPages % 2; // We'll throw this at the cache entry heap

More efficient executable allocator #83632

More efficient executable allocator #83632

Conversation

davidwrighton commented Mar 18, 2023 • edited Loading

davidwrighton commented Mar 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shushanhf Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

shushanhf Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvorli commented Apr 5, 2023

janvorli left a comment

Choose a reason for hiding this comment

davidwrighton commented Mar 18, 2023 •

edited

Loading

shushanhf Apr 17, 2023 •

edited

Loading

shushanhf Apr 17, 2023 •

edited

Loading