Fix named mutex workaround for .NET Mono runtime #57003

uweigand · 2021-10-07T10:56:42Z

Enable named mutex workaround when running on .NET using Mono runtime
Re-implement file locking logic to ensure intra- and inter-process
mutual exclusion, and no longer leak lock files
Use mutex abstraction throughout test suite
Fix async race in ServerFailsWithLongTempPathUnix test case
Fixes Named mutex bug on .NET using Mono runtime causing compile server OOM #57002

tmds · 2021-10-07T15:35:30Z

src/Compilers/Shared/BuildServerConnection.cs

-                    {
-                        mutex?.Dispose();
-                    }
+                    return ServerFileMutexPair.WasOpen(mutexName);


This used to lock the -alive file and if we were unable to lock it, we'd assume the server was holding a lock.
Now it checks the file exists. So if the server has died before it deleted the file, this will no longer return false.

You're right, this seems to have been an incorrect optimization. But I think we can just remove it and go back to trying to the lock the mutex as before. I'll verify.

Ah, turns out this wasn't just an optimization. The problem with the current implementation is that it actually takes the lock -even if just for a short period of time-. If you have the client spin in a loop on WasServerMutexOpen to wait until the server has started, this may actually prevent the server from starting, because it may just hit one of those short periods where the client has taken the lock, and then think another server is already running.

The named mutex implementation doesn't have that problem since you can use TryOpenExisting without actually taking the lock.

That was the actual reason why I wanted to change the Mono fallback of WasOpen to use something that doesn't actually take the lock. If we cannot use file existence as that check, I'll need to think about this a bit more ...

This WasOpen is also subject to races, even with named mutexes. By the time a connection is attempted WasOpen may have changed from true to false.

I think it would be better to remove WasOpen and perform an actual connect.
If it fails with ECONNREFUSED/ENOENT the server is not running, and a new server must be started.
Within some timeout (TimeOutMsNewProcess) we should try to connect to the new server (and allow getting some ECONNREFUSED/ENOENT until it has started).

Well, yes, the WasServerMutexOpen routine is subject to certain races by design, but it looks like the callers expect and handle those. While I guess this could all be designed differently, at this point I'm really only looking to make the existing callers work with the file-based implementation just as they do with the mutex-based implementation. This requires to avoid introducing those other types of races which the callers do not currently expect.

src/Compilers/Core/Portable/InternalUtilities/PlatformInformation.cs

jaredpar · 2021-10-07T15:36:16Z

src/Compilers/Server/VBCSCompilerTests/CompilerServerTests.cs

            await ApplyEnvironmentVariables(
                new[] { new KeyValuePair<string, string>("TMPDIR", newTempDir.Path) },
-                async () =>
+                async () => await Task.Run(async () =>


Can you elaborate on why this change is needed here?

I described the background in the linked issue. But in brief, ApplyEnvironmentVariables sets TMPDIR, runs the lambda, and then resets TMPDIR to its prior value. But it doesn't do any await on the function, and it seems that this means there's a race where sometimes TMPDIR is reset in the middle of executing the lambda - I've seen cases where the server was started with the new TMPDIR and then the client ran with the old TMPDIR. Since in the file lock model TMPDIR becomes part of the lock file name, server and client were then using two different lock files, causing failures.

jaredpar · 2021-10-07T15:39:19Z

src/Compilers/Shared/BuildServerConnection.cs

-            //  'delete if no one has the file open' operation atomically on *nix. This is a leak.
+            if (IsLocked)
+            {
+                File.Delete(FilePath);


Consider the case where the compiler server process is killed. That will leave the "lock" file on disk. Hence all future invocations of the server will believe there is an active lock which there is in fact not. How does the code account for this case?

So this should actually work correctly with the TryLock, because it tries to open the file using FileShare.None. This will fail if the server that has the file open is still running, but it will succeed if the server has crashed even if the file is still there (terminating the process holding the lock frees it).

As Tom points out above, my new WasOpen implementation is indeed incorrect in that case, so I'll go back to just attempting a TryLock there as it used to.

tmds · 2021-10-07T15:52:54Z

src/Compilers/Shared/BuildServerConnection.cs

-            //  'delete if no one has the file open' operation atomically on *nix. This is a leak.
+            if (IsLocked)
+            {
+                File.Delete(FilePath);


This is subject to a subtle race.

Another thread may open the file before we delete it and successfully lock it after we've disposed the FileStream.
So that thread assumes it holds the mutex.

Now, if another threads tries to own the mutex, it will also succeed because it will create a new file.

NuGet suffers from the same issue and that is why it doesn't clean up its lock files in /tmp.

.NET 7 will handle this race when combining FileMode.OpenOrCreate, FileShare.None, FileOptions.DeleteOnClose (dotnet/runtime#55327).

I see, that's unfortunate. I guess we'll have to go back to leaking files. Unlocking would then simply dispose the FileStream, which releases the lock.

uweigand · 2021-10-08T12:20:09Z

Given the various issues pointed out, I've gone back and redesigned the whole file locking approach. This new attempt should address everything raised earlier.

The basic idea of the new approach is to use two files in a slightly different manner: a "lock" file and a "guard" file.

The lock file represents the status of the lock, it is present and opened exclusively (FileShare.None) when the lock is held, it is
deleted when the lock is not held, and it may be present but not locked exclusively if the lock is abandoned (process holding it died).

The guard file is used only internally to the lock algorithm, and only ever held for very brief periods of time. Its purpose is to
protect modifications to the status of the lock file where those cannot be otherwise made atomically.

This allows in particular for the core of the WasServerMutexOpen logic to be implemented by atomically (trying to) lock the lock
file and then unlock it again, all while holding the guard file. This prevents any other process from seeing a spurious "locked"
status at some instance in between.

The other benefit of the guard file is that allows the lock file to be deleted on unlock while avoiding the FileShare.None creation
race pointed out by Tom. Because we create the lock file only while holding the guard, and we delete the lock file only while holding an exclusive lock to it, at the end of a successful FileStream constructor with FileShare.None, we can only be in one of these two stable situations: either the file still exists on disk and we hold an exclusive lock to it, or else the file was deleted and does not exist on disk. We can use a simple File.Exists check to safely distinguish the two cases and close the race.

This prevents leaking of lock files. We still cannot delete the guard file due to the same race. But because the guard file is only held internally, and for bounded (short) periods of time, we can actually use a single guard file for all locks. So we leak only a single file, which I'd consider acceptable (in particular given we already leak the directory).

As a comment on the implementation: since the guard and lock files are now used somewhat asymmetrically, I've given up the abstraction of a ServerFileMutexPair holding two FileMutex instances, but instead just have a single ServerFileMutex that manages both guard and lock file internally.

The new implementation still fixes the original problem I saw during source-build, and still reliably passes the full VBCSCompiler.UnitTests suite.

tmds · 2021-10-08T12:53:21Z

src/Compilers/Shared/BuildServerConnection.cs

+                // with file deletion because we hold the lock.)
+                if (!File.Exists(FilePath))
+                {
+                    stream.Dispose();


The file was deleted because someone gave up their lock. We should try the open again here, and we're expected to be able to lock the file.

Yes, the caller (TryLock) will retry.

I see. The lock could be obtained, but you wait for another attempt to actually obtain it.
I'd take it immediately.

Maybe add a comment about it in the !Exists path.

I mean, I'm handling it the same as any other failure to acquire the lock (IOException) - both return false to the caller, which then may (or may not, depending on which caller it is) decide to retry. This is only a rare corner case, not sure it makes sense to add extra program logic to handle this differently. I can add another comment, of course.

IOException means the file cannot be locked. !Exists means the lock should be obtainable. I was surprised it causes TryLock to return false. A comment would definitely be helpful.

I guess this may just be semantics, but to me they both mean basically the same thing: "in the recent past, someone else owned the lock, and therefore we failed to acquire it on this try". Anyway, I'll add the comment.

src/Compilers/Shared/BuildServerConnection.cs

jaredpar · 2021-10-08T18:04:38Z

src/Compilers/Shared/BuildServerConnection.cs

+                try
+                {
+                    // Attempt to acquire lock while holding guard.
+                    using var guard = LockGuard();


Think we should be passing down and respecting the timeoutMs parameter in the LockGuard method. There are cases where that method could infinitely loop while this method is meant to have an explicit timeout.

The easiest, but admittedly, pathological case where this could happen is just a user opening guard for edit on the machine with FileShare.None outside this process.

I can see why we would want to avoid infinite loops. However, I think using the original timeout may be the wrong approach, in particular if the timeout is zero. Any failure to acquire the guard is likely spurious (in particular as the same guard file is shared across multiple locks), so we should not give up on the first try. What about just limiting the number of retries in LockGuard by some fixed constant (e.g. 100) instead?

That sounds reasonable to me.

src/Compilers/Shared/BuildServerConnection.cs

jaredpar · 2021-10-08T18:48:37Z

src/Compilers/Shared/BuildServerConnection.cs

+                    return false;
+                }
+            }
+            catch (IOException)


This should just catch Exception. There are many exceptions that IO can throw that don't derive from IOException and those would cause the server to fail to start up entirely with the current catch structure. As your discussion above indicated the method isn't concerned with "why" this couldn't be acquired just if it could or could not be acquired.

This was intended to match behavior of the current code. My understanding is that we want to distiguish between failures where it makes sense to retry (i.e. someone else owns the lock and we wait until they give it up), and failures that are permanent so it makes no sense to retry (e.g. the /tmp directory is not writable in a container).

Are there any failure classes of the first type that are not handled by the IOException check?

Code using FileShare.None for locking in NuGet.Client also handles UnauthorizedAccessException: https://github.com/tmds/NuGet.Client/blob/0645de18160e94b80401d449d99a36ced8ece5a4/src/NuGet.Core/NuGet.Common/ConcurrencyUtilities.cs#L48-L69. I think this could be handling some Windows specific behavior.

The constructor is documented as throwing a variety of exceptions including IOException, SecurityException, ArgumentException (less interesting), and UnauthorizedAccessException.

In general though predicting the set of exceptions a method in .NET can throw often not worthwhile. It's easy to miss them (both in docs and in reading through code). What I do is often ask myself the following question:

Imagine I get a bug report in a few months that said "turns out this API also throws WhackyException".

If my response to that is to change the code to catch WhackyException then it's better to just have the code catch Exception right off the bat.

For IO in particular it's generally less interesting why it failed as it is to know whether or not it failed. That is why we generally prefer Exception here.

OK, now updated to catch all exceptions here.

jaredpar · 2021-10-11T17:26:24Z

src/Compilers/Shared/BuildServerConnection.cs

+                return;
+            IsDisposed = true;
+            if (Stream is not null)
+                UnlockFile();


The UnlockFile method is doing IO that could potential throw. Need to catch that here so that exceptions don't leak out from Dispose methods.

Ah, right. Added a try/catch now.

jaredpar

Modulo dealing with the exception in Dispose

jaredpar · 2021-10-11T17:38:51Z

@RikkiGibson, @chsienki PTAL

* Enable named mutex workaround when running on .NET using Mono runtime * Re-implement file locking logic to ensure intra- and inter-process mutual exclusion, and no longer leak lock files * Use mutex abstraction throughout test suite * Fix async race in ServerFailsWithLongTempPathUnix test case * Fixes dotnet#57002

uweigand · 2021-10-15T16:36:45Z

Just checking if there is any update on this ...

I understand it's already very late for this, but it would be great if this could still make it into .NET 6, because without the patch the compiler is not really usable on our Mono-based linux-s390x platform (large compile jobs will cause OOM aborts due to the large number of compile servers being spawned).

jaredpar · 2021-10-15T16:52:26Z

I just poked the reviewers to get a second review here (we require two reviews to product changes).

This is unlikely to make it into .NET 6.0.100 series. It's too late in the cycle and the risk factor in the change is too high. This will be in the 6.0.200 SDK though. A preview of that should be shipping essentially at the same time as .NET 6.0.100 RTM time. That should unblock the developer scenarios (this isn't a runtime issues so won't impact product). You can also hand patch the compiler it the 6.0.100 SDK using our NuPkg as a short term workaround.

uweigand · 2021-10-15T16:54:28Z

I just poked the reviewers to get a second review here (we require two reviews to product changes).

This is unlikely to make it into .NET 6.0.100 series. It's too late in the cycle and the risk factor in the change is too high. This will be in the 6.0.200 SDK though. A preview of that should be shipping essentially at the same time as .NET 6.0.100 RTM time. That should unblock the developer scenarios (this isn't a runtime issues so won't impact product). You can also hand patch the compiler it the 6.0.100 SDK using our NuPkg as a short term workaround.

That sounds very reasonable, thanks!

jaredpar · 2021-10-20T15:33:01Z

Thanks for the contribution!

uweigand requested a review from a team as a code owner October 7, 2021 10:56

ghost added Community The pull request was submitted by a contributor who is not a Microsoft employee. Area-Compilers and removed Community The pull request was submitted by a contributor who is not a Microsoft employee. labels Oct 7, 2021

uweigand force-pushed the mono-namedmutex branch from c45e40e to 549922a Compare October 7, 2021 13:37

tmds reviewed Oct 7, 2021

View reviewed changes

jaredpar reviewed Oct 7, 2021

View reviewed changes

tmds reviewed Oct 7, 2021

View reviewed changes

uweigand force-pushed the mono-namedmutex branch from 549922a to 7569565 Compare October 8, 2021 12:17

tmds reviewed Oct 8, 2021

View reviewed changes

jaredpar reviewed Oct 8, 2021

View reviewed changes

uweigand force-pushed the mono-namedmutex branch 2 times, most recently from d5544a2 to 9f43d42 Compare October 11, 2021 16:54

jaredpar reviewed Oct 11, 2021

View reviewed changes

jaredpar approved these changes Oct 11, 2021

View reviewed changes

uweigand force-pushed the mono-namedmutex branch from 9f43d42 to 995e358 Compare October 11, 2021 17:48

RikkiGibson approved these changes Oct 15, 2021

View reviewed changes

jaredpar merged commit d6eeb64 into dotnet:main Oct 20, 2021

ghost added this to the Next milestone Oct 20, 2021

RikkiGibson modified the milestones: Next, 17.1.P1 Oct 25, 2021

ayakael mentioned this pull request Oct 30, 2022

Set minimum stack size using PT_GNU_STACK header for Musl dotnet/runtime#72920

Closed

Fix named mutex workaround for .NET Mono runtime #57003

Fix named mutex workaround for .NET Mono runtime #57003

Uh oh!

Conversation

uweigand commented Oct 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uweigand Oct 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uweigand commented Oct 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaredpar left a comment

Choose a reason for hiding this comment

Uh oh!

jaredpar commented Oct 11, 2021

Uh oh!

uweigand commented Oct 15, 2021

Uh oh!

uweigand Oct 7, 2021 •

edited

Loading