-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag unprotected concurrent dictionary access #25482
Comments
Using the simple program below in a few tries the loop was always of an entry whose static void Main(string[] args)
{
var d = new Dictionary<int, int>();
var r = new Random(42);
Parallel.For(0, Int32.MaxValue, (int i) =>
{
switch (r.Next(2))
{
case 0:
try
{
var val = r.Next(5);
if (!d.ContainsKey(val))
d.Add(val, 1);
}
catch (ArgumentException) { }
catch (IndexOutOfRangeException) { }
break;
case 1:
try
{
var val = r.Next(5);
if (d.ContainsKey(val))
d.Remove(val);
}
catch (KeyNotFoundException) { }
break;
}
});
} |
@danmosemsft My thinking about the problem also suggests that one node loops are likely to be the most common. That suggest that simply testing for the single node loop at formation time (which is clearly wrong), may be a sufficient mitigation. (we don't have to be perfect, just common, if we catch even 20% of the errors then give it 5X more time and it will fail (and this is just an alert). Another thought is that it is probably MORE common that nodes are lost (the link lists break). It occurs to me that there are pretty strong invariants we can take advantage of. In particular We keep both a freelist Count and a freelist (I don't know why it seems redundant). We could also keep the freeList count in the free nodes If the value stored in the nodes ever differed from the freeList count that would indicate a invariant break, and we could throw. This should be pretty cheep and very sensitive (much more than the loop detectore). |
Makes sense. Cc @jkotas any concerns/thoughts |
Could do something like: https://github.com/dotnet/coreclr/compare/master...benaadams:dictionary-concurrent?expand=1 Isn't 100% and not sure about perf; but would catch a variety of situations (100% of spins, some double modify, change prior to return...) Could make it 100% with |
Agree that it make sense to do something here. I think we should guarantee that the Dictionary never hangs. If a large scale clould service starts hitting it in production today when stars align, they are in trouble - spinning servers are much worse problem than crashed servers. Detecting the imminent hang 20% of the time does not get them out of the trouble. |
@stephentoub FHI |
For insert, we can piggy back on the existing collisionCount local to detect the hang. |
True, updated suggestion https://github.com/dotnet/coreclr/compare/master...benaadams:dictionary-concurrent?expand=1 |
This is going to hurt on non-Intel HW. I think we should just start with the simple change that just detects the hangs. Once we get feedback that it is not sufficient, we can improve upon it. |
Should it be a failfast error? |
Are the version checks adding anything over putting the simple |
They catch data structure corruption that doesn't necessarily lead to a hang (e.g. Earlier in a chain of concurrent uses) Though slowing Dictionary down makes me sad |
IMO a Failfast in release is only for indeterminate corruption or security violation. It has to be bad enough that it is better to immediately destroy all process state. If we supported UI apps this is a case where a user could reasonably try to save their state, or at least the app could try to write to a recovery file. If it helps we could throw something more exotic than InvalidOperationException that is less likely to be blindly ignored. |
I suggest the goal should be a targeted fix just to turn hang into fault. It is a specially insidious case of corruption. |
Concurrent use of the Dictionary could lead to a torn write (between two Inserts) of an Entry with a key not matching it's value; if you are using the data to store user info or make decision you will be in a very bad way (cpu usage aside) |
Could "fault" the Dictionary on detection and make every operation on it fail after; rather than fail fast? (as per enumerator) |
You could just null out _buckets... It's useless at that point anyway. It leaves entries rooted, which might be useful in dump analysis. |
That's true for most invalid concurrent access to shared state. I agree with Dan's assessment of needing a very targeted change; let's not let this scope creep. |
@benadams any interest in measuring the minimal loop counter change, since you already got started? And I know you have a special interest in Dictionary perf... |
I want to push back on the desire catching 100% (or near) of the hangs. Yes, if these hangs happen in production it is unfortunate, but it is no worse than now, AND it takes only one instance of this throw for users to be alerted that there is a problem. Moreover, as Ben has pointed out, hangs may be the most obvious issue but frankly I were running a service I would be MORE concerned about the fact that this table is very likely dropping entries or even returning the wrong value for a key. That is MUCH more difficult to debug, and arguably more impactful to the service (since a HANG is a very good indicator that something is wrong). Consider if we have the 100% hang detection our users find their problem in 1 day. If we have something that only detects 50% it doubles that to 2 days. I don't think that is that bad. The important part is that they find it. Thus I think the goal here is to have hopefully unnoticeable amount of perf/complexity impact. Ben's change is pretty light weight (but heavier than Fengs original idea). However I would recommend that we only do it on MUTATORS of the dictionary. Yes, that cuts down the likelihood that you catch the concurrent access quickly, but frankly fetches are much more common, and concurrent fetches ARE generally benign. This probably cuts the overhead by factor of 2 or more (since reads probably outnumber writes by that amount). Let's not over-engineer this. Technically this is a breaking change (it is DESIGNED to throw more exceptions than before). That is one of the benefits of Feng's original change (it turns only HANGS into exceptions, now we detecting more errors, but that is actually BAD for compatibility). It should be simple, and very inexpensive. Everything else is negotiable. I am OK with Ben's change if we confine it to mutators (but we should measure the impact). We should centralize the throw because we may need a compat switch to turn it off. . |
Yes, we should equivalent of the Feng's change. Turn HANGs into exceptions, and nothing else for now. I agree that we should not over-engineer this. However, I do not understand why you are pushing back on detecting 100% of the hangs. It is exactly what Feng's change has done, and it is very simple. |
I suspect that folks responsible for keeping large sites alive would have different opinion about this. Living with a pager makes your coding style more defensive. |
Then you'll ask me to write tests ;) |
I am OK with Fengs change (simply detecting hangs). It is simple and low cost (again we should measure). I may have been misinterpreting feedback, by my pushback on 100% was about the more complex changes that were suggested that catch more stuff (with _version). If we are going with Fengs suggestion I am happy. |
I should note that while Feng's change has the nice compat feature, my expectation is that hangs only account for a small fraction of the corruption that is happening when you use things like Dictionary in concurrent environments. Lost entries seem like the most likely problem (and lost free list entries). We are not detecting these (almost by design because of compat), which I think is unfortunate (which is why I would be OK with a lean version of Ben's idea). The fact that my service is just operating incorrectly a small fraction of the time .(in ways likely to be subtle), but in a way that is likely to grow over time (as the table becomes more and more corrupt) is actually pretty scary. But I am OK with just taking a step. |
Added PR just to break the loops dotnet/coreclr#16991 |
Its working :) |
Dictionary<K, V>
is not threadsafe for concurrent access involving writes. If this occurs the datastructure can get corrupted and more and more threads can get "captured" insideFindEntry
orTryInsert
(or certain other methods), looping continously until the process terminates.HashSet<T>
has a similar backing datastructure and can similarly "capture" threads insideAddIfNotPresent
(or one of several other methods).[Aside on how these work: lookups in these datastructures use hashing into a
_buckets
array to find an index into the_entries
array (named_slots
inHashSet<T>
). If there is no hash collision thatEntry
(Slot
) contains the sought after entry. If there is a collision on that hash, it begins a chain, and lookup follows the chain. If the desired entry is not present in the collection then the chain should terminate in an empty entry and the method returns (possibly after adding the entry). When entries are added, they are prepended to this chain and_buckets
and then_entries
are fixed up. After removes, unused slots are kept in a free chain with a head is stored in_freelist
to avoid needing to scan_entries
for unused slots. Concurrent updates to these chains (or possibly concurrent traverse and write) can lead to loops at which point any read or write can potentially be sucked into a busy loop.]Although this always indicates a bug in the calling code we have now seen this impact several large services and it can cause expensive and growing CPU monopolization until much or all of the host machine is looping endlessly. Given that we should consider whether there is a cheap way that we can flag buggy callers and throw, similar to how we maintain a
version
counter purely to flag buggy callers.@vancem suggested a check could be done only on insert/remove (ie write), so there would be no perf impact on readers.
The text was updated successfully, but these errors were encountered: