Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Delegate MethodInfo cache with the MethodDesc #99200

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

MichalPetryka
Copy link
Contributor

@MichalPetryka MichalPetryka commented Mar 3, 2024

First attempt at making delegate GC fields immutable in CoreCLR so that they can be allocated on the NonGC heap.

I've checked it with a simple app and corerun locally with a delegate from an unloadable ALC and it seemed to not crash, assert nor unload the ALC from under the delegate, however I couldn't actually find any runtime tests that would verify delegates from unloadable ALCs work so the CI coverage might be missing.

One small point of concern is that this might make delegate equality checks slower since they rely on checking the methods in the last "slow path" check, which is however always hit for different delegates AFAIR.

Contributes to #85014.

cc @jkotas

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 4, 2024

I'm not sure what's up with the failures here, tests that are failing on the CI seem to pass on my machine.
EDIT: I was testing with R2R disabled locally since VS kept complaining about being unable to load PDBs for it.

src/coreclr/vm/object.cpp Outdated Show resolved Hide resolved
MichalPetryka and others added 3 commits March 4, 2024 16:40
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
@MichalPetryka MichalPetryka marked this pull request as ready for review March 4, 2024 23:28
@AndyAyersMS
Copy link
Member

/azp run runtime-coreclr gcstress0x3-gcstress0xc

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

Could you please collect some perf numbers to give us an idea about the improvements and regressions in the affected areas? We may want to do some optimizations to mitigate the regressions.

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

One small point of concern is that this might make delegate equality checks slower since they rely on checking the methods in the last "slow path" check, which is however always hit for different delegates AFAIR.

The existing code tries to compare MethodInfos as a cheap fast path. Most delegates do not have cached MethodInfo, so this fast path is hit rarely - but it is very cheap, so it is still worth it.

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Mar 5, 2024

Could you please collect some perf numbers to give us an idea about the improvements and regressions in the affected areas? We may want to do some optimizations to mitigate the regressions.

I think the thing that'd need benchmarking here are equality checks and maybe the impact of collectible delegates being stored in the CWT on the GC, the rest of things shouldn't be performance sensitive enough to matter I think? I'm not fully sure what'd be the proper way for benchmarking the latter.

This cheap fast path is not cheap anymore with this change. It may be best to delete the fast path that is trying to compare the MethodInfos and potentially optimize Delegate_InternalEqualMethodHandles instead.

I am going to benchmark the impact of the equality change tomorrow, I feel like if the impact won't be big, potential optimizations to it can be done later.

@jkotas
Copy link
Member

jkotas commented Mar 5, 2024

I'm not fully sure what'd be the proper way for benchmarking the latter.

Write a small program that loads an assembly as collectible and calls a method in it. The method in collectible assembly can create delegates in a loop. (If you would like to do it under benchmarknet, it works too - but it is probably more work.)

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Oct 18, 2024

We should be talking numbers. Micro-benchmark numbers for affected methods and also how often are the affected methods called in real-world.

I've benchmarked the APIs, had to massage the code a bit to make the JIT happy though:
Equals (cached means that Method was called):


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
StaticUncachedUncached Job-JVFCER PR 4.316 ns 0.1073 ns 0.2788 ns 0.14 0.01 1,161 B - NA
StaticUncachedUncached Job-XPKMUE Main 31.299 ns 0.3014 ns 0.2672 ns 1.00 0.01 1,230 B - NA
StaticCachedCached Job-JVFCER PR 3.979 ns 0.1002 ns 0.2221 ns 0.74 0.04 1,161 B - NA
StaticCachedCached Job-XPKMUE Main 5.379 ns 0.1277 ns 0.1255 ns 1.00 0.03 1,233 B - NA
InstanceUncachedUncached Job-JVFCER PR 4.892 ns 0.1166 ns 0.1296 ns 0.14 0.00 1,151 B - NA
InstanceUncachedUncached Job-XPKMUE Main 36.004 ns 0.7288 ns 0.7798 ns 1.00 0.03 1,212 B - NA
InstanceCachedCached Job-JVFCER PR 5.057 ns 0.1318 ns 0.3759 ns 0.92 0.07 1,151 B - NA
InstanceCachedCached Job-XPKMUE Main 5.522 ns 0.1119 ns 0.1569 ns 1.00 0.04 1,225 B - NA
StaticUncachedCached Job-JVFCER PR 3.628 ns 0.0808 ns 0.0716 ns 0.12 0.00 1,161 B - NA
StaticUncachedCached Job-XPKMUE Main 31.057 ns 0.3800 ns 0.3554 ns 1.00 0.02 1,230 B - NA
StaticCachedUncached Job-JVFCER PR 3.569 ns 0.0495 ns 0.0463 ns 0.11 0.00 1,161 B - NA
StaticCachedUncached Job-XPKMUE Main 31.897 ns 0.1509 ns 0.1338 ns 1.00 0.01 1,254 B - NA
InstanceUncachedCached Job-JVFCER PR 4.276 ns 0.1043 ns 0.1958 ns 0.13 0.01 1,151 B - NA
InstanceUncachedCached Job-XPKMUE Main 32.738 ns 0.3142 ns 0.2624 ns 1.00 0.01 1,212 B - NA
InstanceCachedUncached Job-JVFCER PR 4.654 ns 0.1103 ns 0.2490 ns 0.14 0.01 1,151 B - NA
InstanceCachedUncached Job-XPKMUE Main 34.013 ns 0.6977 ns 0.9315 ns 1.00 0.04 1,237 B - NA

GetHashCode:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Static Job-JVFCER PR 2.198 ns 0.0043 ns 0.0041 ns 0.50 672 B - NA
Static Job-XPKMUE Main 4.372 ns 0.0058 ns 0.0051 ns 1.00 860 B - NA
Instance Job-JVFCER PR 3.534 ns 0.0140 ns 0.0131 ns 0.59 808 B - NA
Instance Job-XPKMUE Main 5.997 ns 0.0078 ns 0.0073 ns 1.00 883 B - NA

Method:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-JVFCER : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XPKMUE : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Static Job-JVFCER PR 9.305 ns 0.0154 ns 0.0144 ns 5.55 0.02 353 B - NA
Static Job-XPKMUE Main 1.677 ns 0.0078 ns 0.0069 ns 1.00 0.01 1,556 B - NA
Instance Job-JVFCER PR 9.317 ns 0.0189 ns 0.0177 ns 5.54 0.03 353 B - NA
Instance Job-XPKMUE Main 1.682 ns 0.0098 ns 0.0091 ns 1.00 0.01 1,556 B - NA

@MichalPetryka
Copy link
Contributor Author

I'm not sure if the System.Text.Json failures here are related, I saw it here before with Windows x86 Debug and now it's on OSX x64 Debug, couldn't find an issue for them either.

@MichalPetryka
Copy link
Contributor Author

For context, my only motivation with this is unblocking #85014 cause I already got non PGO delegate inlining mostly working locally, but it currently only handles static readonly delegates and requires that to work for lambdas and #108606 and #108579 to work better with delegates in general. (I also don't have the NativeAOT VM implementation for my new JIT-EE API finished yet)

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The PR title is not correct - this change is not making the delegates immutable anymore.

@@ -1725,8 +1729,10 @@ extern "C" void QCALLTYPE Delegate_Construct(QCall::ObjectHandleOnStack _this, Q
if (COMDelegate::NeedsWrapperDelegate(pMeth))
refThis = COMDelegate::CreateWrapperDelegate(refThis, pMeth);

refThis->SetMethodDesc(pMethOrig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it out from this PR. It is hard to see whether these places use the right MethodDesc in all corner cases.

Also, it is very incomplete, so it won't enable the JIT to do anything interesting with the MethodDesc either.

@MichalPetryka MichalPetryka changed the title Make delegates immutable Replace Delegate MethodInfo cache with the MethodDesc Oct 22, 2024
@jkotas
Copy link
Member

jkotas commented Oct 23, 2024

@EgorBot -intel

using System;
using System.Diagnostics;
using System.Reflection.Emit;
using BenchmarkDotNet.Attributes;

public class Bench
{
    [Benchmark]
    public void DummyDynamicMethod()
    {
        DynamicMethod dm = new DynamicMethod("Dummy", typeof(void), null);

        ILGenerator il = dm.GetILGenerator();
        il.Emit(OpCodes.Ret);

        Action a = dm.CreateDelegate<Action>();
        a();
    }
}

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

I am hesitant to merge this after looking into the perf impact some more. This change makes reflection and dynamic methods measurably more expensive that is going to show up. EgorBot job that I have just submitted is an example of a regression. I would like to have something to show for these regressions. We can look at this as part of a larger change that has improvements too.

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Oct 24, 2024

I am hesitant to merge this after looking into the perf impact some more. This change makes reflection and dynamic methods measurably more expensive that is going to show up. EgorBot job that I have just submitted is an example of a regression. I would like to have something to show for these regressions. We can look at this as part of a larger change that has improvements too.

I assume the regression with DynamicMethods comes from StoreDynamicMethod now adding stuff to the CWT which causes more GC pressure to be added by the DependentHandles created for that. This could potentially be a benchmark-only issue since I'm not sure how often do users create millions of DynamicMethods at once like BDN would here.
One thing I could think of to workaround this would be overloading _invovationList even more and storing the DynamicMethod there, that'd make the IsLogicallyNull checks more expensive though. Not sure if not storing it in StoreDynamicMethod at all and then lazily fetching it would work.

As for the reflection part, I'd assume fetching .Method being 10ns vs 2ns wouldn't really matter in the big scale, I'd assume pretty much any operation using MethodInfo to dwarf that in cost.

@MichalPetryka
Copy link
Contributor Author

Another thing to note here, as far as I can see, NativeAOT doesn't seem to cache the MethodInfo at all and seems to fetch it on every call. In such case, I'd assume Reflection perf regression to be even less of an issue since it's much worse there already.

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

the regression with DynamicMethods comes from StoreDynamicMethod now adding stuff to the CWT

I have seen large part of the regressions coming from indirect effects of large CWT. It makes the GC to do more work, makes GC pause times worse.

NativeAOT

Yes, native AOT reflection implementation has number of perf issues.

@MichalPetryka
Copy link
Contributor Author

@EgorBot -intel

using System;
using System.Diagnostics;
using System.Reflection.Emit;
using BenchmarkDotNet.Attributes;

public class Bench
{
    [Benchmark]
    public void DummyDynamicMethod()
    {
        DynamicMethod dm = new DynamicMethod("Dummy", typeof(void), null);

        ILGenerator il = dm.GetILGenerator();
        il.Emit(OpCodes.Ret);

        Action a = dm.CreateDelegate<Action>();
        a();
    }
}

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Oct 24, 2024

Seems the overhead is gone after moving the DynamicMethod to _invocationList.

@MichalPetryka
Copy link
Contributor Author

I am hesitant to merge this after looking into the perf impact some more. This change makes reflection and dynamic methods measurably more expensive that is going to show up. EgorBot job that I have just submitted is an example of a regression. I would like to have something to show for these regressions. We can look at this as part of a larger change that has improvements too.

Since the DynamicMethod overhead is gone now, I think only the reflection overhead is left. I personally can't think of any way that lets us avoid that while still making delegates possible to place on FOH (other than putting the MethodInfos on FOH but that seems even harder to do), do you maybe have any ideas for solving that that don't require rewriting delegates as a whole?
If there are no alternatives that remove regressions from this, I'd prefer then to justify those by the benefits stemming from making delegates be placed on FOH and being better inlineable for lambdas, especially in AOT scenarios than to give up.

@jkotas
Copy link
Member

jkotas commented Oct 28, 2024

I'd prefer then to justify those by the benefits stemming from making delegates be placed on FOH and being better inlineable for lambdas

I understand why change like this is necessary for placing delegates on FOH, but I do not think it helps with inlineability.

@MichalPetryka
Copy link
Contributor Author

I'd prefer then to justify those by the benefits stemming from making delegates be placed on FOH and being better inlineable for lambdas

I understand why change like this is necessary for placing delegates on FOH, but I do not think it helps with inlineability.

Ah sorry, I've meant that replacing the current field caching scheme used by Roslyn with #85014 would help with that. So this indirectly helps by unlocking it cause we need to put them on FOH for that.

@jkotas
Copy link
Member

jkotas commented Oct 28, 2024

So this indirectly helps by unlocking it cause we need to put them on FOH for that.

I do not think it is strictly required to put the delegates to FOH. For throughput, the difference is one indirection when loading the delegate instance that is not going to make a significant difference. In any case, the scheme from #85014 would need a path that does not put the delegates on FOH to handle collectible assemblies.

@MichalPetryka
Copy link
Contributor Author

In any case, the scheme from #85014 would need a path that does not put the delegates on FOH to handle collectible assemblies.

Couldn't we alloc a separate FOH segment for each unloadable context?

@jkotas
Copy link
Member

jkotas commented Oct 28, 2024

I do not see how it would work. Delegates for lambdas in collectible assemblies have to be tracked by the GC. They must keep the collectible assembly alive.

@MichalPetryka
Copy link
Contributor Author

I do not see how it would work. Delegates for lambdas in collectible assemblies have to be tracked by the GC. They must keep the collectible assembly alive.

Ah right, forgot about that. As far as I've seen, strings currently deal with collectible assemblies by allocating a pinned handle for every string. Would we want to do something similar or where would they be stored on the normal heap for them?

@jkotas
Copy link
Member

jkotas commented Oct 28, 2024

a pinned handle for every string.

Nit: It is a pinned heap handle. It is not a regular pinned handle.

Would we want to do something similar

Probably.

@MichalPetryka
Copy link
Contributor Author

@jkotas I was thinking, could we get rid of _invocationCount with this and move its content to _methodPtrAux and _methodDesc now? We could rely on the fact that we won't see valid pointers under ushort.MaxValue and use that to recognize if we have a pointer or some value there.

@jkotas
Copy link
Member

jkotas commented Oct 29, 2024

We could rely on the fact that we won't see valid pointers under ushort.MaxValue and use that to recognize if we have a pointer or some value there.

Pointers can be as small as 4k on the supported platforms. I agree that there may be ways to make delegate smaller on CoreCLR, but I do not think that this specific trick would work well.

@MichalPetryka
Copy link
Contributor Author

We could rely on the fact that we won't see valid pointers under ushort.MaxValue and use that to recognize if we have a pointer or some value there.

Pointers can be as small as 4k on the supported platforms. I agree that there may be ways to make delegate smaller on CoreCLR, but I do not think that this specific trick would work well.

Ah right, unmanaged delegates use the Aux field already, that makes stuff problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants