Skip to content

Conversation

@uweigand
Copy link
Contributor

@uweigand uweigand commented Oct 7, 2021

@uweigand uweigand requested a review from a team as a code owner October 7, 2021 10:56
@ghost ghost added Community The pull request was submitted by a contributor who is not a Microsoft employee. Area-Compilers and removed Community The pull request was submitted by a contributor who is not a Microsoft employee. labels Oct 7, 2021
{
mutex?.Dispose();
}
return ServerFileMutexPair.WasOpen(mutexName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to lock the -alive file and if we were unable to lock it, we'd assume the server was holding a lock.
Now it checks the file exists. So if the server has died before it deleted the file, this will no longer return false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this seems to have been an incorrect optimization. But I think we can just remove it and go back to trying to the lock the mutex as before. I'll verify.

Copy link
Contributor Author

@uweigand uweigand Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, turns out this wasn't just an optimization. The problem with the current implementation is that it actually takes the lock -even if just for a short period of time-. If you have the client spin in a loop on WasServerMutexOpen to wait until the server has started, this may actually prevent the server from starting, because it may just hit one of those short periods where the client has taken the lock, and then think another server is already running.

The named mutex implementation doesn't have that problem since you can use TryOpenExisting without actually taking the lock.

That was the actual reason why I wanted to change the Mono fallback of WasOpen to use something that doesn't actually take the lock. If we cannot use file existence as that check, I'll need to think about this a bit more ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This WasOpen is also subject to races, even with named mutexes. By the time a connection is attempted WasOpen may have changed from true to false.

I think it would be better to remove WasOpen and perform an actual connect.
If it fails with ECONNREFUSED/ENOENT the server is not running, and a new server must be started.
Within some timeout (TimeOutMsNewProcess) we should try to connect to the new server (and allow getting some ECONNREFUSED/ENOENT until it has started).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes, the WasServerMutexOpen routine is subject to certain races by design, but it looks like the callers expect and handle those. While I guess this could all be designed differently, at this point I'm really only looking to make the existing callers work with the file-based implementation just as they do with the mutex-based implementation. This requires to avoid introducing those other types of races which the callers do not currently expect.

await ApplyEnvironmentVariables(
new[] { new KeyValuePair<string, string>("TMPDIR", newTempDir.Path) },
async () =>
async () => await Task.Run(async () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why this change is needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I described the background in the linked issue. But in brief, ApplyEnvironmentVariables sets TMPDIR, runs the lambda, and then resets TMPDIR to its prior value. But it doesn't do any await on the function, and it seems that this means there's a race where sometimes TMPDIR is reset in the middle of executing the lambda - I've seen cases where the server was started with the new TMPDIR and then the client ran with the old TMPDIR. Since in the file lock model TMPDIR becomes part of the lock file name, server and client were then using two different lock files, causing failures.

// 'delete if no one has the file open' operation atomically on *nix. This is a leak.
if (IsLocked)
{
File.Delete(FilePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case where the compiler server process is killed. That will leave the "lock" file on disk. Hence all future invocations of the server will believe there is an active lock which there is in fact not. How does the code account for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should actually work correctly with the TryLock, because it tries to open the file using FileShare.None. This will fail if the server that has the file open is still running, but it will succeed if the server has crashed even if the file is still there (terminating the process holding the lock frees it).

As Tom points out above, my new WasOpen implementation is indeed incorrect in that case, so I'll go back to just attempting a TryLock there as it used to.

// 'delete if no one has the file open' operation atomically on *nix. This is a leak.
if (IsLocked)
{
File.Delete(FilePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subject to a subtle race.

Another thread may open the file before we delete it and successfully lock it after we've disposed the FileStream.
So that thread assumes it holds the mutex.

Now, if another threads tries to own the mutex, it will also succeed because it will create a new file.

NuGet suffers from the same issue and that is why it doesn't clean up its lock files in /tmp.

.NET 7 will handle this race when combining FileMode.OpenOrCreate, FileShare.None, FileOptions.DeleteOnClose (dotnet/runtime#55327).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's unfortunate. I guess we'll have to go back to leaking files. Unlocking would then simply dispose the FileStream, which releases the lock.

@uweigand
Copy link
Contributor Author

uweigand commented Oct 8, 2021

Given the various issues pointed out, I've gone back and redesigned the whole file locking approach. This new attempt should address everything raised earlier.

The basic idea of the new approach is to use two files in a slightly different manner: a "lock" file and a "guard" file.

The lock file represents the status of the lock, it is present and opened exclusively (FileShare.None) when the lock is held, it is
deleted when the lock is not held, and it may be present but not locked exclusively if the lock is abandoned (process holding it died).

The guard file is used only internally to the lock algorithm, and only ever held for very brief periods of time. Its purpose is to
protect modifications to the status of the lock file where those cannot be otherwise made atomically.

This allows in particular for the core of the WasServerMutexOpen logic to be implemented by atomically (trying to) lock the lock
file and then unlock it again, all while holding the guard file. This prevents any other process from seeing a spurious "locked"
status at some instance in between.

The other benefit of the guard file is that allows the lock file to be deleted on unlock while avoiding the FileShare.None creation
race pointed out by Tom. Because we create the lock file only while holding the guard, and we delete the lock file only while holding an exclusive lock to it, at the end of a successful FileStream constructor with FileShare.None, we can only be in one of these two stable situations: either the file still exists on disk and we hold an exclusive lock to it, or else the file was deleted and does not exist on disk. We can use a simple File.Exists check to safely distinguish the two cases and close the race.

This prevents leaking of lock files. We still cannot delete the guard file due to the same race. But because the guard file is only held internally, and for bounded (short) periods of time, we can actually use a single guard file for all locks. So we leak only a single file, which I'd consider acceptable (in particular given we already leak the directory).

As a comment on the implementation: since the guard and lock files are now used somewhat asymmetrically, I've given up the abstraction of a ServerFileMutexPair holding two FileMutex instances, but instead just have a single ServerFileMutex that manages both guard and lock file internally.

The new implementation still fixes the original problem I saw during source-build, and still reliably passes the full VBCSCompiler.UnitTests suite.

// with file deletion because we hold the lock.)
if (!File.Exists(FilePath))
{
stream.Dispose();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file was deleted because someone gave up their lock. We should try the open again here, and we're expected to be able to lock the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the caller (TryLock) will retry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The lock could be obtained, but you wait for another attempt to actually obtain it.
I'd take it immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment about it in the !Exists path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I'm handling it the same as any other failure to acquire the lock (IOException) - both return false to the caller, which then may (or may not, depending on which caller it is) decide to retry. This is only a rare corner case, not sure it makes sense to add extra program logic to handle this differently. I can add another comment, of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException means the file cannot be locked. !Exists means the lock should be obtainable. I was surprised it causes TryLock to return false. A comment would definitely be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this may just be semantics, but to me they both mean basically the same thing: "in the recent past, someone else owned the lock, and therefore we failed to acquire it on this try". Anyway, I'll add the comment.

try
{
// Attempt to acquire lock while holding guard.
using var guard = LockGuard();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should be passing down and respecting the timeoutMs parameter in the LockGuard method. There are cases where that method could infinitely loop while this method is meant to have an explicit timeout.

The easiest, but admittedly, pathological case where this could happen is just a user opening guard for edit on the machine with FileShare.None outside this process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why we would want to avoid infinite loops. However, I think using the original timeout may be the wrong approach, in particular if the timeout is zero. Any failure to acquire the guard is likely spurious (in particular as the same guard file is shared across multiple locks), so we should not give up on the first try. What about just limiting the number of retries in LockGuard by some fixed constant (e.g. 100) instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me.

return false;
}
}
catch (IOException)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just catch Exception. There are many exceptions that IO can throw that don't derive from IOException and those would cause the server to fail to start up entirely with the current catch structure. As your discussion above indicated the method isn't concerned with "why" this couldn't be acquired just if it could or could not be acquired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended to match behavior of the current code. My understanding is that we want to distiguish between failures where it makes sense to retry (i.e. someone else owns the lock and we wait until they give it up), and failures that are permanent so it makes no sense to retry (e.g. the /tmp directory is not writable in a container).

Are there any failure classes of the first type that are not handled by the IOException check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code using FileShare.None for locking in NuGet.Client also handles UnauthorizedAccessException: https://github.com/tmds/NuGet.Client/blob/0645de18160e94b80401d449d99a36ced8ece5a4/src/NuGet.Core/NuGet.Common/ConcurrencyUtilities.cs#L48-L69. I think this could be handling some Windows specific behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor is documented as throwing a variety of exceptions including IOException, SecurityException, ArgumentException (less interesting), and UnauthorizedAccessException.

In general though predicting the set of exceptions a method in .NET can throw often not worthwhile. It's easy to miss them (both in docs and in reading through code). What I do is often ask myself the following question:

Imagine I get a bug report in a few months that said "turns out this API also throws WhackyException".

If my response to that is to change the code to catch WhackyException then it's better to just have the code catch Exception right off the bat.

For IO in particular it's generally less interesting why it failed as it is to know whether or not it failed. That is why we generally prefer Exception here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now updated to catch all exceptions here.

@uweigand uweigand force-pushed the mono-namedmutex branch 2 times, most recently from d5544a2 to 9f43d42 Compare October 11, 2021 16:54
return;
IsDisposed = true;
if (Stream is not null)
UnlockFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnlockFile method is doing IO that could potential throw. Need to catch that here so that exceptions don't leak out from Dispose methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Added a try/catch now.

Copy link
Member

@jaredpar jaredpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo dealing with the exception in Dispose

@jaredpar
Copy link
Member

@RikkiGibson, @chsienki PTAL

* Enable named mutex workaround when running on .NET using Mono runtime

* Re-implement file locking logic to ensure intra- and inter-process
  mutual exclusion, and no longer leak lock files

* Use mutex abstraction throughout test suite

* Fix async race in ServerFailsWithLongTempPathUnix test case

* Fixes dotnet#57002
@uweigand
Copy link
Contributor Author

Just checking if there is any update on this ...

I understand it's already very late for this, but it would be great if this could still make it into .NET 6, because without the patch the compiler is not really usable on our Mono-based linux-s390x platform (large compile jobs will cause OOM aborts due to the large number of compile servers being spawned).

@jaredpar
Copy link
Member

I just poked the reviewers to get a second review here (we require two reviews to product changes).

This is unlikely to make it into .NET 6.0.100 series. It's too late in the cycle and the risk factor in the change is too high. This will be in the 6.0.200 SDK though. A preview of that should be shipping essentially at the same time as .NET 6.0.100 RTM time. That should unblock the developer scenarios (this isn't a runtime issues so won't impact product). You can also hand patch the compiler it the 6.0.100 SDK using our NuPkg as a short term workaround.

@uweigand
Copy link
Contributor Author

I just poked the reviewers to get a second review here (we require two reviews to product changes).

This is unlikely to make it into .NET 6.0.100 series. It's too late in the cycle and the risk factor in the change is too high. This will be in the 6.0.200 SDK though. A preview of that should be shipping essentially at the same time as .NET 6.0.100 RTM time. That should unblock the developer scenarios (this isn't a runtime issues so won't impact product). You can also hand patch the compiler it the 6.0.100 SDK using our NuPkg as a short term workaround.

That sounds very reasonable, thanks!

@jaredpar jaredpar merged commit d6eeb64 into dotnet:main Oct 20, 2021
@ghost ghost added this to the Next milestone Oct 20, 2021
@jaredpar
Copy link
Member

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Named mutex bug on .NET using Mono runtime causing compile server OOM

4 participants