-
-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another deadlock - or at least a very slow lock #107
Comments
Yes, this is actually a known problem (though it didn't have a GH issue before). In fact, this is exactly the reason why AsyncEx v5 is still in prerelease. One of my commercial clients first made me aware of it, and while they're waiting for a fix they just cranked up the min-threads on the thread pool as a workaround. This problem comes up most often when:
What's actually happening is that the first lock goes through just fine (synchronously). Then the thread pool is exhausted with a bunch of threads synchronously blocked on that lock. Then when the first thread unlocks, it releases the next waiter, but requires a thread pool thread to do so. Thus, it must wait until the thread pool gets around to running that "release" code. Meanwhile, the thread pool is injecting one new thread every 2 seconds or so, in no rush. So we end up with a lock convoy with huge wait times due to the thread pool injection rate. This is due to the core change of v4 -> v5: all continuations are asynchronous in the core of all the primitives ( For a solution, I'm trying out variations of a custom "completer" thread. Something very similar to Ayende's blog here. |
Hi @StephenCleary - thanks very much for the comprehensive answer. Ayende's solution looks pretty smart. Having a thread set aside to "keep things going" whilst also letting the ThreadPool race to handle the continuation might add a tiny bit of overhead (negligible probably) but avoids this, which is a huge plus. Your mention of min threads does remind me that in our web app I've bumped up min threads as it helped with something similar a while ago. Maybe I've just run into it again? Anyway, v3.x (with the one change discussed about a year ago from #57 ) is working fine for me at the moment. I only ran into this, I think, because in our web app I have some leases on some image & blob objects which are acquired asynchronously. There's also a background process which releases them from memory if they haven't been actively used in a while. That's handled using Thanks again for the great library & support! |
Basically these starvation issues can only ever happen when using the AsyncEx primitives both synchronously and asynchronously. Isn't it? |
No; the problem can be caused by either synchronous or asynchronous APIs when the thread pool is saturated. At the core, both sync and async lock APIs require a thread pool to become unlocked, and that's what causes the issue. |
Thanks it's clear now.
Don't you think that spawning your own thread to solve this issue would be both overkill and complex? Making continuations synchronous again, so we don't have this issue, seems way nicer and easier to debug. |
Making the continuations synchronous has its own problems. For a given TaskCompletionSource, its continuations can either be forced-asynchronous (used by v5) or unspecified (used by v4). Synchronization primitives in v4 avoid this thread pool exhaustion issue by using synchronous continuations. However, since end-user code is also expressed as a continuation, we have to force those continuations to be asynchronous or else we'd end up with a callback-under-lock scenario, which very easily results in deadlock. So in v4, all continuations are actually semi-asynchronous; they're synchronous enough to avoid the thread pool exhaustion but asynchronous enough to avoid the deadlock. As you can imagine, this complicates the code considerably. I do much prefer the approach forcing asynchrony for end-user continuations, just from a code perspective (simpler, easier to prove correct, easier to maintain). Assuming I keep this, there are two options for avoiding the thread pool exhaustion issue: introducing a separate thread, and separating end-user continuations from internal continuations (requiring separate TCSs). I've played with both and am definitely leaning towards the thread solution. |
Anytime I've used |
The beauty of asynchronous programming is that "there is no thread". I learned it from Stephen, both from his book and articles. If it's easier to maintain I understand of course, but I just find it a bit "sad" we have to rely on spawning a manual Thread, which isn't controlled by the ThreadPool. It still feels complicated. |
Possibly complicated, but it's entirely abstracted away by the library to the point where 99% of library consumers won't even know it's there. Making continuations asynchronous avoids the "sometimes sync, sometimes async" behaviour that brought jQuery promises a bad name. I appreciate the environment is different - since JS is completely single-threaded - but the consistency argument still applies (see http://blog.izs.me/post/59142742143/designing-apis-for-asynchrony and https://medium.com/@bluepnume/intentionally-unleashing-zalgo-with-promises-ab3f63ead2fd for the JS-side of the fence). |
Thanks Ian these are very good arguments, I agree most people will just consume the library, without looking deep in the internals. One remark though: even today when you await in C#, it may go synchronous or asynchronous as well. |
Just to clarify, the synchronous vs asynchronous continuations that I'm talking about are only when there is a continuation, so from the end-user's perspective this only applies when the await is asynchronous. When I'm talking about the continuation being synchronous or asynchronous, that's from the perspective of the TCS-completing thread. To use a lock example, So there are still plenty of situations with AsyncEx where end-user code will This actually comes down to how It's my personal opinion that JavaScript's promise callback and |
Reading the VS release notes for 15.4.1 (out as of about 40 minutes ago) I came across this Roslyn thread It's not quite the same issue but it reminded me of this discussion. A block synchronous resource - disk - coupled with lots of async calls piling up, then the thread pool starving (or growing very slowly). Their problem is they also gobbled up RAM at the same time. The solution they seem to be discussing (and have released) is to have a single thread to which the other tasks post their requests for work. Sorry if this is off-topic - I just found it interesting as another manifestation of a problem in the same ballpark as this thread's issue. |
That is very interesting. Particularly since the thread pool has special dedicated I/O threads so that the limited thread injection rate shouldn't affect that kind of scenario. But in reality, a lot of the time the BCL code running on the I/O thread will post work to a worker thread so it "jumps" over to the more limiting side before it executes end-user code. |
Why not just create a SemaphoreSlim(1,1) and use WaitAsync() on it. That has the effect of locking? |
@waynebrantley Yes, that's a fine workaround while this bug is still in AsyncEx. |
Would a wrapped SemaphoreSlim(1,1) cause similar issues under high load? |
@sjb-sjb No; it does not depend on a thread pool thread being free in order to release. |
I wonder what the rationale is for using the queuing approach rather than the older SemaphoreSlim method. |
Historically, Under the hood, |
Having this guarantee is an important feature of |
@StephenCleary Which min-thread settings in the thread pool should be enough to alleviate the problem without causing others? |
@jesuissur Bumping the min-threads is one option, and some v5 users are using that as a workaround for now. |
@StephenCleary Sorry, I think my question was not clear enough. Which min-threads value should be a good starting point to try the bumping workaround? |
@jesuissur That depends on your application type and server. I'd recommend getting the minimum threads first and then increasing it. |
Can the thread pool exhaustion be detected upfront? So if you start off something that needs one additional thread, could you know that it's going to be a problem and do something about it while a thread is still awake? |
@ygoe: I've never tried. There's |
I'm sorry I don't know it myself. Just tried to throw in an idea to get this resolved. My own library is now tied to being a pre-release because of this dependency. |
I've been working on a few different experiments related to this issue. During my travelling for OSCON, I thought it would be useful to write out the background and context in which I'm thinking about this problem. Doing this was a great exercise; it allowed me to step back and identify each of the issues at hand (and then choose a new approach). This comment is mostly a brain-dump for now, but some of it will be reworked into a CONTRIBUTING doc at some point. Central Postulates:
AsyncEx CodeMost AsyncEx coordination primitive code is straightforward and doesn't even use async/await. However:
So, a good cross-section for experimentation purposes is:
When I run experiments like this, I focus on these four types. If it works for them, it should work anywhere. AsyncEx v4In v4, all TCS's used synchronous continuations, just like "normal" async code. v4 avoided invoking end-user callbacks under lock by separating the TCS completion code from internal logic. This necessitated a bit of a "dance" between the wait queue and the coordination primitives, with the completion code represented by a disposable. This worked well enough, but has always had one theoretical issue: stack overflow is a possibility. The TPL does have heuristics in place to prevent this, but they don't always work. AFAIK, this was never actually observed in production. In v5, one of the main changes was to use asynchronous continuations for TCS's. This simplified the wait queue code (and coordination primitive code) by removing the "dance" completely. Unfortunately, this has caused a dependency on the thread pool even for internal operations. And this issue is not theoretical; it is very rare but has certainly been observed. Possible SolutionsSolutions must maintain the Central Postulates above, in addition to these considerations:
The solutions that I've considered (some of these have experimental code in different branches):
ConclusionAfter writing all these out, I believe the best solution going forward is the last one. My next step is to experiment with that solution on the experimental cross-section ( |
I'm very sorry to sound demanding, but this project appears dead to me. This library does a really great job and is really helpful in my project. I've read a lot of helpful and profound comments and answers from you in several places, that's why I consider your work valuable. I decided to use the v5 branch of AsyncEx to be more future-proof and I think there was also another reason to use it. But this ongoing delay of a release is starting to get in my way now. I'm about to release a library myself and it makes use of some of the AsyncEx functions. NuGet forces me to label my package as a pre-release because it has a pre-release dependency. So I'm considering to remove that package dependency and incorporate all referenced code into my own code base. This isn't something I like to do, but it seems the only way out of this release handling. A regular fork of the library would be more difficult because I'd need a new name and do all the packaging. And even though forks might immediately help others, too, they usually hurt the community in the long run. So if you could find some time to resolve this last issue or maybe at least note it as a known issue in a stable release, I'd really appreciate that. There have been some suggestions in this discussion that sound promising. I know I'm not very helpful in the matter, I've tried before but this just exceeds my experience. |
@ygoe Yes, I have considered releasing a stable version with this as a known bug. The perfectionist in me doesn't like that, but commercial software does it all the time. I've done a ton of work in different branches, approaching the problem different ways, but hitting blocks each time. I did finally find a permanent solution, but haven't had time to implement it (the final solution also requires careful documentation). At the same time, I'm aware of the problems with keeping a library prerelease for so long. My current plan is to try to finish this issue once and for all over MVP summit (one time of the year when I have a lot more free time than normal). If I don't finish it then, then I promise I'll release it as stable with a known issue. |
@StephenCleary Thank you for the update. I'm looking forward to end of March this year. :-) |
Concerning the use of Here is the code to reproduce this issue using using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
namespace Example
{
public class Program
{
static readonly SemaphoreSlim s = new SemaphoreSlim(1, 1);
static void Main()
{
Task.Run(() =>
{
Console.WriteLine("Starting tasks...");
var sw = new Stopwatch();
sw.Start();
var tasks = RunTasks(onlySync: true, onlyAsync: false);
Console.WriteLine("Sync tasks started. Waiting for all to complete...");
Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("All sync tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");
sw.Reset();
sw.Start();
tasks = RunTasks(onlySync: false, onlyAsync: true);
Console.WriteLine("Async tasks started. Waiting for all to complete...");
Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("All async tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");
sw.Reset();
sw.Start();
tasks = RunTasks(onlySync: false, onlyAsync: false);
Console.WriteLine("Mixed tasks started. Waiting for all to complete...");
Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine("All mixed tasks done! Time elapsed: " + sw.Elapsed.TotalSeconds + " seconds");
}).Wait();
Console.ReadLine();
}
static Task[] RunTasks(bool onlySync, bool onlyAsync)
{
bool mixed = !onlySync && !onlyAsync;
var tasks = new Task[30];
for (int i = 0; i < 30; i++)
{
Task task;
int i1;
if (mixed)
{
if (i % 2 == 0)
{
i1 = i;
task = Task.Run(() => DoWithSemaphore(i1));
}
else
{
i1 = i;
task = Task.Run(async () => await DoWithSemaphoreAsync(i1));
}
}
else if (onlySync)
{
i1 = i;
task = Task.Run(() => DoWithSemaphore(i1));
}
else
{
i1 = i;
task = Task.Run(async () => await DoWithSemaphoreAsync(i1));
}
tasks[i] = task;
}
return tasks;
}
static void DoWithSemaphore(int i)
{
s.Wait();
try
{
// mutate shared state atomically
Thread.Sleep(100);
}
finally
{
s.Release();
}
Console.WriteLine("Done: " + i);
}
static async Task DoWithSemaphoreAsync(int i)
{
await s.WaitAsync();
try
{
// mutate shared state atomically
await Task.Delay(100);
}
finally
{
s.Release();
}
Console.WriteLine("Done (async): " + i);
}
}
}
When you take a look at the source code of As soon as async waiters are in the game (see line 358 of On the other side, if we only have sync waiters (and no async ones), a pulse-wait approach is used, which does not need a thread pool. And if we only have async waiters, the problem does not come into play, since we do not lock a task; instead we await it, and when awaited, the underlying thread goes back to the tread pool. So, I don't think that using @StephenCleary You mentioned that you have a final solution in mind. I'm looking forward for it and a little bit curious what kind of solution it is. |
@StephenCleary How about this approach? Is it safe? |
Hey @StephenCleary, I just ran into this problem in production. Until a v5 fix is in place, do you recommend downgrading to v4 or increasing the thread pool minimum thread count? Also, any idea as to when the issue might be resolved in v5? |
@tmendez3 Whether or not increasing your thread pool minimum will help depends on your system. I use AsyncLock mainly as part of a cache-aside implementation to keep multiple consumers from going to remote caches or system-of-record for the same cache key and I call through that code 500M times a day. It doesn't matter what I set minimum threads to, when using v5 it will sporadically (between a few times a week and a few times a day) use them all up and as many more as it can create before the process gets recycled for being unresponsive. If that start happening on more than 1 of my VMs at a time it's a quick and slippery slope to cascading overload failure. On v3, everything is peachy, been running this system for years with 99.995% reliability. I haven't really given v4 a shot, so I cannot comment on that. I also target .Net Framework v4.7.2 - it sounds like maybe Core does not exhibit this issue - of course I'm going through updating dependencies as part of an effort to target Core (no small task on an active 15 year old code base serving nearly a billion requests a month). I do have two key prefixes in my system that can end up calling AsyncLock.Lock() and AsyncLock.LockAsync() from different places on the same AsyncLock object (every unique cache key gets its own AsyncLock) and I haven't yet tested if removing that overlap clears up our issues with v5 or if they persist. |
Thanks for that use case @qstarin. I did de-prioritize this for quite a while due to concerns that fixing this wouldn't actually fix the underlying issue. Specifically, if your code is exhausting the thread pool, sure that would cause a deadlock (technically slow lock) with However, I do want to fix this issue. There are a couple of paths forward. One is the "fully correct" solution outlined above; this would require several consecutive days of work and considerably complicate the implementation. Unfortunately, my opportunity for "several consecutive days of work" on a hobby project isn't great. Historically, I've been able to do the "hard" work on AsyncEx at the Microsoft MVP conference. But 2020's MVP conference was virtual so I lost that opportunity, and next year's conference is also virtual. The other solution is to essentially take the AsyncEx 4 solution. This would only require a weekend or so, not an entire week. Technically, this solution is incorrect because it assumes Well, there are no obviously correct answers. :) But I think that in the interest of resolving this issue for AsyncEx v6 and above (the fix would require a major version rev), I'm going to take a weekend (sometime...) and restore the v4-style completion behavior. |
…e time The minimum number of threads in the .NET thread pool is increased to 4. This should avoid that an application hangs if a continuation cannot run, because there is no thread to execute it. Also see: StephenCleary/AsyncEx#107
…e time The minimum number of threads in the .NET thread pool is increased to 4. This should avoid that an application hangs if a continuation cannot run, because there is no thread to execute it. Also see: StephenCleary/AsyncEx#107
So any news on this? |
@milspiir Not that I've seen, and I'm still on v3. I did remove any and all mixed sync/async locking on any AsyncLock objects and that made no difference in the deadlocking behavior observed in v5. I also tested v4 and found similar sporadic deadlocking. A simple SemaphoreSlim lock in place of AsyncLock does not appear to deadlock, but performs quite badly. v3 appears to perform noticeably better in both uncontested and contested (especially contested) use than v4 & v5 - all of which perform vastly better than a simple SemaphoreSlim.
fwiw, @StephenCleary, I've had StackOverflowExceptions trigger with the default v3 behavior of resuming TaskCompletionSource's synchronously - it happens when I have close to 300 or more tasks waiting on the same lock object. In those cases, providing AsyncLock with an implementation of IAsyncWaitQueue that creates the TaskCompletionSource with the TaskCreationOptions.RunContinuationsAsynchronously option, but is otherwise the same as the wait queue in v3 of the library, avoids the deadlock - but also performs more poorly than the included implementation that continues synchronously. |
I'm still planning to eventually fix this, and I still haven't had time. As noted previously, my primary hacking time is during conferences, which haven't happened since covid hit. |
As an aside, we just got bit by this. Removing a background task (started with |
Hello Stephen. I'm currently being hit by this (I think). Hope you get a conference soon :-) |
I added this to #57 but thought that since it's closed, the repro for an bug there may not have been noticed.
Here's the last post from that thread... Basically it's an issue which v3 doesn't exhibit but v4 and v5 do. I think I may have hit this in production, just after I finally moved to v4.x after staying on a patched v3.x for many months.
It's intermittent but definitely does reproduce for me after a few executions. Hoping you can shed some light on the issue :)
Thanks for the library and the general community work!
Post from #57 starts now...
The repro provided by @Allon-Guralnek will cause the issue for me.
In LinqPad I've got this
You'll note that I've got a Task.Run() in the main() method to get it started.
Sometimes it runs through just fine. Other times it'll stop.
Example output
and 1 minute 13 seconds later I'm still nowhere :(
After 1 minute 28 seconds it suddenly finished
But I can hit stop & start and get it working again. 9 times out of 10 it works properly. Removing the Console.WriteLine inside the loop seems to make it fail more often.
With LinqPad I can quickly swap between
Nito.AsyncEx.Coordination
NuGet package - this made the poor output aboveNito.AsyncEx
NuGet package v3.x - worksNito.AysyncEx
NuGet package v4.x (strong named one was convenient to use) - failed first go with no output at all :(Nito.AsyncEx
v3.x with one line patched from earlier - worksSo this appears to be a regression where v3.x - with the one-line patch from earlier in this issue - is good but AsyncEx v4.x and v5.x are both bad.
It's a tricky one as it's intermittent and does eventually come good.
The text was updated successfully, but these errors were encountered: