Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition between GCInfo and Rundown #70609

Merged
merged 1 commit into from
Jun 30, 2022
Merged

Conversation

davmason
Copy link
Member

Fixes #69375

  • Adds a check to see if the GCInfo is published yet before trying to emit a rundown event
  • Published GCInfo under the code heap crst so we can only see an uninitialized GCInfo or a complete one
  • Moved the freeing of the temporary code heaps in LCG methods to after we call FreeCodeMemory so we don't race on deletion

@davmason davmason requested a review from a team June 11, 2022 09:19
@davmason davmason self-assigned this Jun 11, 2022
@davmason davmason added this to the 7.0.0 milestone Jun 11, 2022
@jkotas
Copy link
Member

jkotas commented Jun 11, 2022

This makes the JIT/EE contract complicated. Instead of teaching JIT the intricacies of publishing the code and related artifacts that is VM implementation detail, it would be better to publish everything in the right order once the JIT returns here:

jitInfo.WriteCode(jitMgr);
. We are doing some of the publishing in this place already.

@davmason
Copy link
Member Author

@jkotas Looking at the prior art, it seems like EEJitManager is a good place to put this sort of logic. Is that right?

Then CEEJitInfo::WriteCode can call EEJitManager::PublishGCInfo and it can take the code heap lock, and everything is all good - in rundown we can check and either get fully published GCInfo or uninitialized and skip it.

@jkotas
Copy link
Member

jkotas commented Jun 15, 2022

@jkotas Looking at the prior art, it seems like EEJitManager is a good place to put this sort of logic. Is that right?

Yep.

@davmason
Copy link
Member Author

With the test app below it would hit asserts on a checked build within a minute before, and I have run it for an hour or so with no issue now on windows x64. On x86 it runs out of memory after ~25k dynamic methods, but works fine up until then.

I am going to run it against linux arm32 to make sure, but I have no reason to suspect it will be any different there


using Microsoft.Diagnostics.NETCore.Client;
using Microsoft.Diagnostics.Tracing;
using System.Diagnostics;
using System.Diagnostics.Tracing;
using System.Reflection;
using System.Reflection.Emit;

Console.WriteLine("Hello, World!");

long numDynamicMethods = 0;
List<Thread> threads = new List<Thread>();
int numThreads = 100;
for (int i = 0; i < numThreads; i++)
{
    Thread t = new Thread(MakeDynamicMethods);
    t.Start();
    threads.Add(t);
}

Thread gcTriggerThread = new Thread(() =>
    {
        while (true)
        {
            Thread.Sleep(100);
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
    });

gcTriggerThread.Start();
threads.Add(gcTriggerThread);

while (true)
{
    Console.WriteLine($"New EventPipe session, dynamic methods={numDynamicMethods}");

    int processId = Process.GetCurrentProcess().Id;
    DiagnosticsClient client = new DiagnosticsClient(processId);

    int numEvents = 0;
    List<EventPipeProvider> providers = new List<EventPipeProvider>()
    {
        new EventPipeProvider("Microsoft-Windows-DotNETRuntime", EventLevel.Verbose),
        new EventPipeProvider("Microsoft-Windows-DotNETRuntimeRundown", EventLevel.Verbose),
        new EventPipeProvider("Microsoft-DotNETCore-SampleProfiler", EventLevel.Verbose),
    };
    using (EventPipeSession session = client.StartEventPipeSession(providers, /* requestRunDown */ true))
    { 
        EventPipeEventSource source = new EventPipeEventSource(session.EventStream);
        source.Dynamic.All += (TraceEvent traceEvent) =>
        {
            ++numEvents;
        };

        Thread processingThread = new Thread(new ThreadStart(() =>
        {
            source.Process();
            Console.WriteLine($"Saw {numEvents} events.");
        }));
        processingThread.Start();

        Thread.Sleep(100);

        // The events are fired in the JITCompilationStarted callback for TriggerMethod,
        // so by the time we are here, all events should be fired.
        session.Stop();

        processingThread.Join();
    }
}

void MakeDynamicMethods(object? obj)
{
    Random random = new Random();
    while (true)
    {
        AssemblyName name = new AssemblyName(GetRandomName());
        AssemblyBuilder dynamicAssembly = AssemblyBuilder.DefineDynamicAssembly(name, AssemblyBuilderAccess.RunAndCollect);
        ModuleBuilder dynamicModule = dynamicAssembly.DefineDynamicModule(GetRandomName());

        Type[] methodArgs = { typeof(int) };

        DynamicMethod squareIt = new DynamicMethod(
            "SquareIt",
            typeof(long),
            methodArgs,
            dynamicModule);

        ILGenerator il = squareIt.GetILGenerator();
        il.Emit(OpCodes.Ldarg_0);
        il.Emit(OpCodes.Conv_I8);
        il.Emit(OpCodes.Dup);
        il.Emit(OpCodes.Mul);
        il.Emit(OpCodes.Ret);

        OneParameter<long, int> invokeSquareIt =
            (OneParameter<long, int>)
            squareIt.CreateDelegate(typeof(OneParameter<long, int>));

        invokeSquareIt(random.Next());

        Interlocked.Increment(ref numDynamicMethods);
    }
}

static string GetRandomName()
{
    return Guid.NewGuid().ToString();
}

delegate long SquareItInvoker(int input);

delegate TReturn OneParameter<TReturn, TParameter0>
    (TParameter0 p0);

// it is not in use yet.
#ifdef TARGET_X86
hdrInfo gcInfo;
DecodeGCHdrInfo(codeInfo.GetGCInfoToken(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rather check for NULL GetGCInfoToken? I do not think we should be attempting to decode the GC info that has not been published.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this change is not needed with the new approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a couple hours today trying to convince myself if this check is needed or not, and the more I look at it the more convinced I am that the race condition is different than the original hypothesis

The EECodeHeapIterator uses MethodSectionIterator to go through all active jitted methods, and it will only return a method if the appropriate index in HeapList::pHdrMap is set.

But we only set the index in pHdrMap in EEJitManager::NibbleMapSet, which is called from CEEJitInfo::WriteCode, after the GCInfo is generated.

Either I'm missing something or the real issue is a combination of the freeing happening in the wrong order on all arches, and then pointer tearing on arm archictures because the publishing for codeHeaders happens outside the lock

memcpy(codeWriterHolder.GetRW(), m_CodeHeaderRW, m_codeWriteBufferSize);

If I run my repro app with the fix to move freeing the code header to after freeing the code data, I no longer hit the assert on x64, which suggests but does not confirm my hypothesis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the memcpy I point out there also happens before NibbleMapSet, so I think I am missing something

@@ -10970,6 +10970,12 @@ void CEEJitInfo::WriteCode(EEJitManager * jitMgr)
UnwindInfoTable::PublishUnwindInfoForMethod(m_moduleBase, m_CodeHeader->GetUnwindInfo(0), m_totalUnwindInfos);
#endif // defined(TARGET_AMD64)

{
ExecutableWriterHolder<BYTE *> gcInfoWriterHolder(m_CodeHeader->GetGCInfoAddr(), sizeof(void *));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutableWriterHolder is not a cheap operation. WriteCodeBytes has one already. Can we refactor such that we have just one ExecutableWriterHolder for both operations?

@@ -3215,21 +3215,22 @@ BYTE* EEJitManager::allocGCInfo(CodeHeader* pCodeHeader, DWORD blockSize, size_t
} CONTRACTL_END;

MethodDesc* pMD = pCodeHeader->GetMethodDesc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just pass in the MethodDesc instead of the whole CodeHeader. This method does not need the CodeHeader anymore.

block = m_jitManager->allocGCInfo(m_CodeHeaderRW,(DWORD)size, &m_GCinfo_len);
if (!block)
m_pGCInfo = m_jitManager->allocGCInfo(m_CodeHeaderRW,(DWORD)size, &m_GCinfo_len);
if (!m_pGCInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed. allocGCInfo throws on OOM.

@tommcdon
Copy link
Member

@davmason is this PR still active or should we move to draft mode?

@davmason
Copy link
Member Author

@davmason is this PR still active or should we move to draft mode?

Still active, just didn't finish it before I took vacation

@davmason
Copy link
Member Author

@jkotas - I've tested the heck out of it and have convinced myself the only change needed is to free the dynamic code heaps after freeing the code data. I can run my test program for hours without a crash with just that change

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jkotas
Copy link
Member

jkotas commented Jun 30, 2022

The test failure is #70450

@jkotas jkotas merged commit 02b840c into dotnet:main Jun 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tracing/eventpipe/eventsvalidation tests failing with AF: codeLength > 0
3 participants