Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TaskRegistry CPUWatson issues due to false optimistic concurrency #8861

Closed

Conversation

JanKrivanek
Copy link
Member

Fixes ADO#1801351 and ADO#1801341

Context

Various TaskRegistry methods were reported to have infinite loops caused by corrupted dictionary state.

Changes Made

Replacing the flagged Dictionaries and as well other ones on similar execution paths with ConcurrentDictionaries, plus where appropriate guarded access to inner structures as well (where reads are expected only after all writes - simple lock was added)

rainersigwald
rainersigwald previously approved these changes Jun 8, 2023
src/Build/Instance/TaskRegistry.cs Outdated Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
@rainersigwald rainersigwald dismissed their stale review June 8, 2023 19:37

It looked good but didn't pass tests :)

@JanKrivanek
Copy link
Member Author

Aaah - there is a hidden trap - we're relying on implicit ordering of the data in the Dictionary fields (e.g. here https://github.com/dotnet/msbuild/pull/8861/files#diff-9233ff64c785ef48b792931b3c5f0d709e741671c2c0bb2b7afb6c09aaaaeb81R626 ordering influences which task will be returned), which by itself feel very wrong, but it can (and does) break with replacing the backing field with ConcurrentDictionary, which has different internal ordering.

The proper solution is to order the enumerations explicitly (probably by the order of registrations) - however since the previous implementation was lacking the ordering, this can lead to altered behavior perceived as regressions.

@rainersigwald - thoughts on this? Can we introdce explicit ordering by order of registrations?

@JanKrivanek
Copy link
Member Author

JanKrivanek commented Jun 14, 2023

Aaah - there is a hidden trap - we're relying on implicit ordering of the data in the Dictionary fields (e.g. here https://github.com/dotnet/msbuild/pull/8861/files#diff-9233ff64c785ef48b792931b3c5f0d709e741671c2c0bb2b7afb6c09aaaaeb81R626 ordering influences which task will be returned), which by itself feel very wrong, but it can (and does) break with replacing the backing field with ConcurrentDictionary, which has different internal ordering.

The proper solution is to order the enumerations explicitly (probably by the order of registrations) - however since the previous implementation was lacking the ordering, this can lead to altered behavior perceived as regressions.

@rainersigwald Rainer Sigwald FTE - thoughts on this? Can we introdce explicit ordering by order of registrations?

Explicitly addressed here: https://github.com/dotnet/msbuild/pull/8861/files#diff-9233ff64c785ef48b792931b3c5f0d709e741671c2c0bb2b7afb6c09aaaaeb81R593 and here: https://github.com/dotnet/msbuild/pull/8861/files#diff-9233ff64c785ef48b792931b3c5f0d709e741671c2c0bb2b7afb6c09aaaaeb81R604

While this is very likely intended behavior (LIFO of the multiple matching tasks) - it may cause unexpected results in some cases - e.g.: this https://github.com/dotnet/msbuild/pull/8861/files#diff-0c7ff4eddab39e683e61e6f11011eac73dae76d2574999184b3c0e74f9c2fa10R44 was needed to fix the failing test (as wrong GenerateResource task was selected)

Although not sure about this comment: https://github.com/dotnet/msbuild/blob/main/src/Build.UnitTests/BackEnd/TaskRegistry_Tests.cs#LL836C33-L836C52

Do we want the conflicting UsingTasks to be LIFO or FIFO? I believe it's the former..

@JanKrivanek JanKrivanek force-pushed the perfbug/taskregistry-concurrent branch from c2b1734 to 3d36b54 Compare June 15, 2023 08:41
@ladipro
Copy link
Member

ladipro commented Jun 15, 2023

I was just writing a comment in a file and it disappeared from the diff. I'll paste it here because it may still be relevant.

I was specifically looking at the way GenerateResource tasks are registered.

So multiple tasks are registered under the same name, but with different runtimes. You're saying that previously we relied on implicit ordering and it worked "by accident". How does one select the task to run in this scenario? Is there a way to say "I want to invoke the CLR2 version of this task"?

  • If yes, then all call sites should do it (there's likely a bug somewhere else).
  • If not, then is it useful at all to have multiple UsingTasks with the same TaskName that would have their conditions evaluate to true? When we know we always want the last one to win anyways?

I'm still flabbergasted by this, apologies for the stupid questions.

@ladipro
Copy link
Member

ladipro commented Jun 15, 2023

@JanKrivanek and I have synced offline and here's what we found:

  • There is a way to invoke a task with a specific runtime and/or architecture using the MSBuildRuntime and MSBuildArchitecture attributes.
  • The GenerateResource task in common targets is indeed invoked with these attributes specified and the issue with undefined behavior of TaskRegistry, related to non-deterministic order of dictionary enumeration, may not be a problem in practice. It was uncovered with a unit test.
  • Arguably it is still better to avoid this undefined behavior and a introduce a well-defined task-matching order. (My opinion: if someone relies on the current behavior, is is OK to break them.)
  • Since .NET Core does not support cross-runtime or cross-architecture task execution, it is very likely that TaskRegistry could be made much simpler in the .NET build of MSBuild. This would be at the expense of ifdef's or code duplication, and as-of-now-unmeasured perf benefits. It is better left as a future exercise, out of scope of this PR.

src/Build/Instance/TaskRegistry.cs Outdated Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Outdated Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
src/Build/Instance/TaskRegistry.cs Show resolved Hide resolved
@JanKrivanek JanKrivanek force-pushed the perfbug/taskregistry-concurrent branch from d442f94 to e12bec2 Compare June 16, 2023 11:44
@JanKrivanek JanKrivanek requested a review from ladipro June 16, 2023 11:59
Copy link
Member

@rokonec rokonec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

However we are fixing symptoms and not the root cause.
TaskRegistry was never supposed to be thread safety and whole object graph which is referencing it is not supposed to be thread safety as well.

Something is violating our non-thread-safe usage requirements. This shall not be fixed, IMO, by making some subset of data structures thread safe, as it has performance and complexity consequences, but by fixing concurrency usage of our non-thread-safe data structures. That being said, it is hard to guess if guilty code is in MSBuild or VS.

I have made experimental code changes to detect wrong concurrency usages, but I was not able to repro it locally in patched VS INT for Orchard SLN open and build scenario.

We can still decide to fix it this way, but this will hide other possible concurrency bugs caused by same root cause.

I am blocking this PR so we, core msbuild team @rainersigwald @ladipro @JanKrivanek @AR-May, can decide how to approach it.

/// <summary>
/// Cache of tasks already found using exact matching,
/// keyed by the task identity requested.
/// </summary>
private Dictionary<RegisteredTaskIdentity, RegisteredTaskRecord> _cachedTaskRecordsWithExactMatch;
private readonly Lazy<ConcurrentDictionary<RegisteredTaskIdentity, RegisteredTaskRecord>> _cachedTaskRecordsWithExactMatch = new(() => new(RegisteredTaskIdentity.RegisteredTaskIdentityComparer.Exact));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on discusion with @rokonec - those fields might probably be fine to just initialize greedily (low chance of them staying noninitialized after usage of TaskRegistry)

@JanKrivanek
Copy link
Member Author

Changes LGTM.

However we are fixing symptoms and not the root cause. TaskRegistry was never supposed to be thread safety and whole object graph which is referencing it is not supposed to be thread safety as well.

Something is violating our non-thread-safe usage requirements. This shall not be fixed, IMO, by making some subset of data structures thread safe, as it has performance and complexity consequences, but by fixing concurrency usage of our non-thread-safe data structures. That being said, it is hard to guess if guilty code is in MSBuild or VS.

I have made experimental code changes to detect wrong concurrency usages, but I was not able to repro it locally in patched VS INT for Orchard SLN open and build scenario.

We can still decide to fix it this way, but this will hide other possible concurrency bugs caused by same root cause.

I am blocking this PR so we, core msbuild team @rainersigwald @ladipro @JanKrivanek @AR-May, can decide how to approach it.

tl;dr: There is a possibility of race between public API build and evaluation calls. And I agree there might be other ways to fix this.


More details: The TaskRegistry is shared via the Data type which is member of ProjectImpl - which is part of Project. The Data structure can be shared between the evaluation and build (sample usage during build: https://github.com/dotnet/msbuild/blob/main/src/Build/Definition/Project.cs#L3341).

Build is more-or-less guarded not to be concurrent (StartBuilderThread and DedicatedThreadsTaskScheduler are both subject to races - but it's very edgy, so can be considered fine). However evaluation is not being guarded via same exclusion as build requests. Both are exposed via public API in Project.
So we either would need to make Evaluation and Build fully exclusive or make the Data class not to be shared between the two, or make the members of Data class thread safe. First two seem possibly too impactful. Last one can be done by steps (individual members of Data class) - this PR is step for TaskRegistry

@rokonec
Copy link
Member

rokonec commented Jun 22, 2023

Each evaluation creates new TasklRegistry for purpose of that evaluation. So evaluations can't clash with build.

I have found two suspicious places which are indeed wrong, and might be the root cause of observed issues:

this.Toolset = data.Toolset; // UNDONE: This isn't immutable, should be cloned or made immutable; it currently has a pointer to project collection

this.TaskRegistry =
that.TaskRegistry; // UNDONE: This isn't immutable, should be cloned or made immutable; it currently has a pointer to project collection

This needs to be fixed. We have to somehow create deep clone of TaskRegistry. We can't share task registry between clones of ProjectInstance as source ProjectInstance might be affected by cloned ProjectInstance TaskRegistry mutation or data corruption.

@JanKrivanek
Copy link
Member Author

Discussed this offline with @rokonec - a new PR will be created (we need better isolation guarantees way up the stack in ProjectInstance)

@JanKrivanek
Copy link
Member Author

superseded by: #8973

@JanKrivanek JanKrivanek deleted the perfbug/taskregistry-concurrent branch July 14, 2023 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants