Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Fix Lock's usage of NativeRuntimeEventSource.Log to account for recursive accesses during its own class construction #94873

Merged
merged 2 commits into from
Jan 18, 2024

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Nov 16, 2023

NativeRuntimeEventSource.Log can be null if it's being constructed in the same thread earlier in the stack. Added null checks.

Fixes #94728

@kouvel kouvel added this to the 9.0.0 milestone Nov 16, 2023
@kouvel kouvel self-assigned this Nov 16, 2023
@ghost
Copy link

ghost commented Nov 16, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

NativeRuntimeEventSource.Log can be null if it's being constructed in the same thread earlier in the stack. Added null checks.

Fixes #94728

Author: kouvel
Assignees: kouvel
Labels:

area-NativeAOT-coreclr

Milestone: 9.0.0

@kouvel kouvel requested a review from VSadov November 16, 2023 20:23
@MichalStrehovsky
Copy link
Member

With this and #94241, what is our confidence that Lock can still safely be used within the implementation of the native AOT runtime? Do we need to consider migrating these low level usages to LowLevelLock?

In #94515 I used LowLevelLock not because I hit some issue, but I just couldn't reason if it's safe to use Lock in the implementation of the C# typeof (or object.GetType).

@VSadov
Copy link
Member

VSadov commented Nov 20, 2023

With this and #94241, what is our confidence that Lock can still safely be used within the implementation of the native AOT runtime? Do we need to consider migrating these low level usages to LowLevelLock?

When we were stabilizing the old Lock the same question was asked. I think the same reasoning still applies -
NativeAOT runtime is written in C#, uses off-the shelf components and calls everchanging set of various things. It would be impractical to require that nothing of that uses locks or must use a special kind of lock. On the other hand, the Lock is a very simple mechanism, with a simple purpose and calls a very limited set of APIs. It is definitely possible to ensure that those APIs do not use Lock and thus guarantee that there is no reentrancy.

Old lock did not have this problem, ... eventually. The implementation in #93879 does not seem to be affected. This is a fixable issue.

In #94515 I used LowLevelLock not because I hit some issue, but I just couldn't reason if it's safe to use Lock in the implementation of the C# typeof (or object.GetType).

The only legit use of LowLevelLock in NativeAOT is in implementation of Wait subsystem. Since Lock uses that, we can't use Lock there.
In most other places, including typeof, it should be fine to use Lock.

@VSadov
Copy link
Member

VSadov commented Nov 20, 2023

@kouvel - is there a good understanding of the repro? It worries me a bit that the bug reproduces exclusively on linux-arm64. That was also the case with the previous bug.
Is linux-arm64 the only configuration that enables logging, by any chance?

Relying on linux-arm64 for stress might not be a good place to be. If there was a way to force this to repro on other platforms, especially on win-x64, it would be very useful. I wonder what makes linux-arm64 special here?

@MichalStrehovsky
Copy link
Member

The only legit use of LowLevelLock in NativeAOT is in implementation of Wait subsystem. Since Lock uses that, we can't use Lock there.
In most other places, including typeof, it should be fine to use Lock.

Thanks, I sent out a PR to use that instead: #94991.

Is linux-arm64 the only configuration that enables logging, by any chance?

It shouldn't be linux-arm64 specific, but we definitely enable EventSource for fewer tests. EventSource is disabled by default in the shipping configuration (some workloads like ASP.NET will enable it back). The tests mirror that. Only tests that test event source enable it back (search for <EventSourceSupport>true</EventSourceSupport> to see the tests that do).

@kouvel
Copy link
Member Author

kouvel commented Nov 20, 2023

@kouvel - is there a good understanding of the repro? It worries me a bit that the bug reproduces exclusively on linux-arm64. That was also the case with the previous bug.

It's a timing issue that can lead to the stack trace in #94728 or something like it. It's not specific to a platform or architecture, probably not easy to repro without inducing timings.

With this and #94241, what is our confidence that Lock can still safely be used within the implementation of the native AOT runtime?

There appears to be a circular dependency where Lock does some lazy initialization that involves class construction, which in turn uses Lock for synchronization. That makes the system a bit brittle to changes. For instance, new issues were uncovered when raising events through NativeRuntimeEventSource. Maybe one way to break that circular dependency would be for runtime uses of Lock to pass in a boolean that bypasses lazy initialization and uses defaults, along with not raising events. I'll see if we see more of this type of issue or cases that are not easily fixable from the Lock side, maybe something like that can be done.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

RuntimeEventSource is eagerly initialized on module load because it has [ModuleInitializer]. That indirectly initializes NativeRuntimeEventSource. By the time we start running some meaningful code, NativeRuntimeEventSource should be initialized.

Somehow on arm64 we start seeing contentions from sockets before module initialization is complete. Not sure if that was expected by whoever have put [ModuleInitializer] there.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

I think the most reliable fix for this would be to stop calling static initialization from the Lock entirely, and make something else do the initialization instead.
For that we need:

  1. make static initialization optional for the Lock functionality. It may result in losing a few events when timing is tight, but it does not seem to be a big deal or avoidable.
    (currently suggested change is half way there already - it allows losing events while the Log is not ready)
  2. make something else to perform initialization of statics. There are a few options here:
    -- Use [ModuleInitializer]
    -- Use finalizer of some dummy object
    like in https://github.com/dotnet/runtime/blob/1c4bd555ecca1ed49cd7a259f23c35e6d83665ee/src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs#L130-L146

In either case there is a chance to have a small window of operating with uninitialized statics, but that is ok as long as the initialization is optional.

The advantage of [ModuleInitializer] is being more explicit and upfront, but there could be concern that module init would run at program startup and the NativeRuntimeEventSource initialization could be a bit heavy (there appears to be some reflection use, type loading, attribute parsing,..).
I am not sure about the cost of reading/parsing the config settings.

The Finalizer approach allows to wait until we are sure that we have a program that may have contentions to report.

@MichalStrehovsky
Copy link
Member

Somehow on arm64 we start seeing contentions from sockets before module initialization is complete. Not sure if that was expected by whoever have put [ModuleInitializer] there.

Maybe an ordering issue with this:

[ModuleInitializer]
internal static void InitializeQuic()
{
// This will load Quic (if supported) to avoid interference with RemoteExecutor
// See https://github.com/dotnet/runtime/pull/75424 for more details
// IsSupported currently does not unload lttng. If it does in the future,
// we may need to call some real Quic API here to get everything loaded properly
_ = OperatingSystem.IsLinux() && QuicConnection.IsSupported;
}

The code looks to be Linux-only so that would explain we don't see it on Windows. Cc @LakshanF. IIRC RuntimeEventSource initialization it was made a ModuleInitializer because running it as part of CoreLib initialization was too early and we don't have any other initialization phase before Main. Maybe we need one.

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

We do topological sort of ModuleInitializers. CoreLib module initializer should run first and the Quic module initializer should run second. Both CoreLib module initializer and the Quic module initializer start new threads. The problem seems to be triggered when the background work done by the Quic module initializer happens faster than the background work done by the Quic module initializer.

I think the root cause of the problem is use of managed Lock by the native AOT ClassConstructorRunner. The equivalent code in regular CoreCLR uses low-level lock and it is why it avoids the dependency cycles. Should we fix this issue by switching ClassConstructorRunner to use the low-level lock?

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

Second related problem is that the managed locks use high-level wait that can be overridden using synchronization context. Try running this:

// #define NEWLOCK

using System;
using System.Threading;

#if NEWLOCK
Lock mylock = new Lock();
#else
object mylock = new object();
#endif

#if NEWLOCK
using (mylock.EnterScope())
#else
lock (mylock)
#endif
{
    Thread t = new Thread(Work);
    t.Start();
    t.Join();
}

void Work()
{
    SynchronizationContext.SetSynchronizationContext(new MySynchronizationContext());

#if NEWLOCK
    using (mylock.EnterScope())
#else
    lock (mylock)
#endif
    {
        Console.WriteLine("Hello world");
    }
}

class MySynchronizationContext : SynchronizationContext
{
    public MySynchronizationContext()
    {
        SetWaitNotificationRequired();
    }

    public override int Wait(IntPtr[] waitHandles, bool waitAll, int millisecondsTimeout)
    {
        Console.WriteLine("MySynchronizationContext.Wait called at");
        Console.WriteLine(Environment.StackTrace);
        return base.Wait(waitHandles, waitAll, millisecondsTimeout);
    }
}

It makes the managed locks fairly high-level service that is not appropriate for low-level services like class constructor runner. Using the managed locks in low-level services like class constructor runner or frozen object allocator is a farm for re-entrancy crashes.

I think the rule of thumb should be both regular CoreCLR and native AOT should use same type of lock in equivalent services. If CoreCLR uses low-level lock for something, native AOT should do the same.

@kouvel
Copy link
Member Author

kouvel commented Nov 21, 2023

Second related problem is that the managed locks use high-level wait that can be overridden using synchronization context.

Given that it wouldn't be feasible to use Lock in low-level cases, Lock could be simplified a bit by removing the lazy initialization and maybe other things.

Should I make the relevant changes to ClassConstructorRunner also?

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

Should I make the relevant changes to ClassConstructorRunner

I think so. And delete/simplify code that has been working around the Lock use in ClassConstructorRunner. Thank you!

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

We do topological sort of ModuleInitializers. CoreLib module initializer should run first and the Quic module initializer should run second. Both CoreLib module initializer and the Quic module initializer start new threads. The problem seems to be triggered when the background work done by the Quic module initializer happens faster than the background work done by the Quic module initializer.

Is that a problem or something expected?
If they both start threads and complete in random order, why bother ordering them? Is it the order of their start or completion that matters?
(I do not have a lot of opinion on this, but it kind of still feel suspicious that it affects only linux-arm64)

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

The problem seems to be triggered when the background work done by the Quic module initializer happens faster than the background work done by the Quic module initializer.

Is that a problem or something expected?

It is expected.

The problem is that two threads running in parallel cause contention in constructor runner that leads to the crash.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

Second related problem is that the managed locks use high-level wait that can be overridden using synchronization context. Try running this:

It is a bit surprising why we decided to do this to locks in the first place (perhaps something with COM/STA as usual), and keep maintaining. In normal use Lock has 2 levels of optimizations to avoid Waiting, and lock{} has 3 levels. Outside of deadlocks the feature would be very nondeterministic.
It might be late to argue that this feature is as useful as Thread.Interrupt and would not be very missed in the same way.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

It makes the managed locks fairly high-level service that is not appropriate for low-level services like class constructor runner.

The argument seems equally applicable to any feature "X". - If I use feature X in SynchronizationContext.Wait, then I may manufacture recursions when X uses locks.

It makes the managed locks fairly high-level service that is not appropriate for low-level services like class constructor runner. Using the managed locks in low-level services like class constructor runner or frozen object allocator is a farm for re-entrancy crashes.

How far will this go? What about type unifiers, sync table, conditional weak table, weak reference implementation, COM stuff, other places that use locks? Are they all broken because someone might do weird things with SynchronizationContext?

I think it is doable and is a lot cheaper to just keep lock{}/Lock compatible with the runtime, except for a set of functionality that Lock itself uses (and that set can be very small).
That would make for a simpler world where every feature (runtime or not) could be concerned with its own needs - needs locking, use lock.

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

How far will this go?

Yes, it needs to stop somewhere. The split between ordinary locks and managed locks is very arbitrary today. My point is that the best way to fix reentrancy issues specific to native AOT is by using same type of lock between runtimes in the current state of the world.

I think it is doable and is a lot cheaper to just keep lock{}/Lock compatible with the runtime except for a set of functionality that Lock itself uses (and that set can be very small).

I would be open to the idea of making the new Lock use primitive waits. It would require investigation of what SynchronizationContext overrides are used for today and whether they would get broken by it.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

I would be open to the idea of making the new Lock use primitive waits. It would require investigation of what SynchronizationContext overrides are used for today and whether they would get broken by it.

Right. I was going to literally suggest "I'd go as far as use uninterruptible Wait in Lock", but decided to search for anything that could be possibly broken. (other than tests).

So far I found nothing interesting. SynchronizationContext is used, but mostly for Post/Send/Complete purposes.

Noone seems to be overriding Wait in any nontrivial way. I found only 2 cases both just forward to nonalertable WaitForMultipleObjectsEx, ironically - to suppress the message pumping behavior of standard wait.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

Yes, it needs to stop somewhere. The split between ordinary locks and managed locks is very arbitrary today. My point is that the best way to fix reentrancy issues specific to native AOT is by using same type of lock between runtimes in the current state of the world.

I think the root cause for reentrancy is that Lock initializes EventSource as a part of its regular operations and that runs arbitrary code and that causes reentrancies. If static initialization happens on a side - in a module initializer or in a finalizer, the problem will disappear.

I do not think that CoreCLR is a good guidance on what lock to use. The reasons in CoreCLR could be different, often just because of different native/managed split, dances around COOP/PREEMPT or because of some appdomain/sql stuff that may not be applicable anymore.
In NativeAOT most locks can be the same lock as regular code would use. I did not see a convincing reason other than the regression above, and that has a simpler solution than making specialized locks.

The static initialization is actually a good example.

LowLevelLock lacks functionality for that. It also has fairly poor performance. That was never an issue as LowLevelLock was basically a "backstop" lock for cases that happen rarely or nearly always resulted in a blocking wait. Perf improvements would not "move the needle".
For static initializers we would need a recursive lock that also supports timeouts and preferably has the best performance that we can have. We may end up with basically a copy of Lock, just one that uses uninterruptible Wait and does not do EventSource diagnostics.

It would be a roundabout solution to something that could be solved inside the Lock itself.

Alternatively, Lock could have an internal ctor/flag to not initialize EventSource, but I think plugging into the diagnostics would be nice even for system locks.

@jkotas
Copy link
Member

jkotas commented Nov 21, 2023

For static initializers we would need a recursive lock that also supports timeouts

I do not think we need that for static initializers. The static initializers are scheduled by "dead-lock aware lock". The regular (low-level) lock is needed to protect the dead-lock aware lock state. There should not be any recursion. The low-level lock for this purpose does not need to be fine-tuned, given other overheads involved in running the static constructors.

We have the low-level lock exposed by the System.Native PAL. I think it would be hard to get rid of all uses of it. As long as we have the low-level lock exposed by the System.Native PAL, it should be fine to use it in S.P.CoreLib as needed.

@VSadov
Copy link
Member

VSadov commented Nov 21, 2023

There should not be any recursion. The low-level lock for this purpose does not need to be fine-tuned, given other overheads involved in running the static constructors.

Given the fact that we run into contentions, and often enough to bring up the reentrancy issue, indicates that this lock can be a hot spot and thus a potential bottleneck if lock does not handle contentions well.

As long as we have the low-level lock exposed by the System.Native PAL, it should be fine to use it in S.P.CoreLib as needed.

It is ok to use it. It works. It is just generally there are no reasons to use it vs. regular lock{}/Lock.
Perhaps people get some sense of extra safety - "oh oh, this looks like low level code, I probably should use LowLevelLock".

There is an issue of "what if someone does things with SynchronizationContext.Wait", but to shield from that most locks would need to be LowLevelLock. That is probably why noone "does things with SynchronizationContext.Wait" as that would be dangerous.

@kouvel
Copy link
Member Author

kouvel commented Nov 21, 2023

What about locks taken on UI threads? I imagine they would still need to use a message-pumping wait for correctness and that could run user code and potentially lead to reentrancy. Should locks used in the runtime (in low-level situations) also use message-pumping waits?

@kouvel
Copy link
Member Author

kouvel commented Jan 4, 2024

@kouvel Do you have a plan in mind for how to proceed on this one?

Yes. I'm more in favor of reusing Lock for this purpose with a wait that does not allow reentrance. LowLevelLock (or some variant of it) would certainly be an option, but it wasn't meant for this kind of purpose in its original or current state (which are the same). There might also be more uses for non-reentrant locks in other cases. LowLevelLock was created for specific purposes that are likely still valid. I'd like to see LowLevelLock go away eventually though, ideally Lock should be able to handle those cases just as well.

@MichalStrehovsky
Copy link
Member

MichalStrehovsky commented Jan 16, 2024

@kouvel do you have any timelines for this? It is a heavy hitter in the CI (I personally see this failing every day and CI stats say we hit it 15 times a day on average) and we're about to snap .NET 9 Preview 1.

…sive accesses during its own class construction

- `NativeRuntimeEventSource.Log` can be null if it's being constructed in the same thread earlier in the stack, added null checks

Fixes dotnet#94728
@kouvel
Copy link
Member Author

kouvel commented Jan 16, 2024

@kouvel do you have any timelines for this? It is a heavy hitter in the CI (I personally see this failing every day and CI stats say we hit it 15 times a day on average) and we're about to snap .NET 9 Preview 1.

The proposed change in this PR would I believe fix the CI issue. The discussion in this PR has evolved into resolving a larger issue, which I'm currently looking into, hope to have something out soon. After thinking about it more, I may continue to employ the same technique for the CI issue, as using a finalizer doesn't guarantee timely initialization and I'd prefer to avoid going down the wait path before initialization is complete to avoid other potential reentrancy issues (and to have a way to fix a reentrancy issue on that path if one is introduced), though I'm still evaluating it.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jkotas jkotas closed this Jan 17, 2024
@jkotas jkotas reopened this Jan 17, 2024
@jkotas
Copy link
Member

jkotas commented Jan 17, 2024

resolving a larger issue, which I'm currently looking into, hope to have something out soon.

Could you please open an issue for this if there is not one already?

@kouvel
Copy link
Member Author

kouvel commented Jan 17, 2024

Rebased, and added a commit with one more null check in the wait path used by Lock. Opened #97105 to track making some of the waits unalertable.

@kouvel
Copy link
Member Author

kouvel commented Jan 18, 2024

Looks like all remaining CI failures are known issues.

@kouvel kouvel merged commit ee9426a into dotnet:main Jan 18, 2024
176 of 178 checks passed
@kouvel kouvel deleted the LockFix branch January 18, 2024 03:09
kouvel added a commit to kouvel/runtime that referenced this pull request Jan 19, 2024
- Added an internal constructor that enables the lock to use non-alertable waits
- Non-alertable waits are not forwarded to `SynchronizationContext` wait overrides, are non-message-pumping waits, and are not interruptible
- Updated most of the uses of `Lock` in NativeAOT to use non-alertable waits
- Also simplified the fix in dotnet#94873 to avoid having to do the relevant null checks in various places along the wait path, by limiting the scope of the null checks to the initialization phase

Fixes dotnet#97105
tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
…nt for recursive accesses during its own class construction (dotnet#94873)

* Fix Lock's usage of NativeRuntimeEventSource.Log to account for recursive accesses during its own class construction

- `NativeRuntimeEventSource.Log` can be null if it's being constructed in the same thread earlier in the stack, added null checks

Fixes dotnet#94728
kouvel added a commit that referenced this pull request Jan 24, 2024
…7227)

* Add an internal mode to `Lock` to have it use trivial (non-alertable) waits

- Added an internal constructor that enables the lock to use trivial waits
- Trivial waits are non-alertable waits that are not forwarded to `SynchronizationContext` wait overrides, are non-message-pumping waits, and are not interruptible
- Updated most of the uses of `Lock` in NativeAOT to use trivial waits
- Also simplified the fix in #94873 to avoid having to do the relevant null checks in various places along the wait path, by limiting the scope of the null checks to the initialization phase

Fixes #97105
@lewing
Copy link
Member

lewing commented Jan 25, 2024

Rebased, and added a commit with one more null check in the wait path used by Lock. Opened #97105 to track making some of the waits unalertable.

Unfortunately, the Build analysis can't match the logs once azdo cancels the build. We're tracking that issue in #97044 but it won't get automatically flagged

@github-actions github-actions bot locked and limited conversation to collaborators Feb 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure - System.NullReferenceException in System.Threading.Lock.TryInitializeStatics
7 participants