Flag unprotected concurrent dictionary access #25482

danmoseley · 2018-03-15T23:51:50Z

Dictionary<K, V> is not threadsafe for concurrent access involving writes. If this occurs the datastructure can get corrupted and more and more threads can get "captured" inside FindEntry or TryInsert (or certain other methods), looping continously until the process terminates. HashSet<T> has a similar backing datastructure and can similarly "capture" threads inside AddIfNotPresent (or one of several other methods).

[Aside on how these work: lookups in these datastructures use hashing into a _buckets array to find an index into the _entries array (named _slots in HashSet<T>). If there is no hash collision that Entry (Slot) contains the sought after entry. If there is a collision on that hash, it begins a chain, and lookup follows the chain. If the desired entry is not present in the collection then the chain should terminate in an empty entry and the method returns (possibly after adding the entry). When entries are added, they are prepended to this chain and _buckets and then _entries are fixed up. After removes, unused slots are kept in a free chain with a head is stored in _freelist to avoid needing to scan _entries for unused slots. Concurrent updates to these chains (or possibly concurrent traverse and write) can lead to loops at which point any read or write can potentially be sucked into a busy loop.]

Although this always indicates a bug in the calling code we have now seen this impact several large services and it can cause expensive and growing CPU monopolization until much or all of the host machine is looping endlessly. Given that we should consider whether there is a cheap way that we can flag buggy callers and throw, similar to how we maintain a version counter purely to flag buggy callers.

@vancem suggested a check could be done only on insert/remove (ie write), so there would be no perf impact on readers.

The text was updated successfully, but these errors were encountered:

danmoseley · 2018-03-16T00:24:37Z

Using the simple program below in a few tries the loop was always of an entry whose next pointed to itself. That is not necessarily the only kind of loop.

        static void Main(string[] args)
        {
            var d = new Dictionary<int, int>();
            var r = new Random(42);

            Parallel.For(0, Int32.MaxValue, (int i) =>
            {
                switch (r.Next(2))
                {
                    case 0:
                        try
                        {
                            var val = r.Next(5);
                            if (!d.ContainsKey(val))
                                d.Add(val, 1);
                        }
                        catch (ArgumentException) { }
                        catch (IndexOutOfRangeException) { }
                        break;
                    case 1:
                        try
                        {
                            var val = r.Next(5);
                            if (d.ContainsKey(val))
                                d.Remove(val);
                        }
                        catch (KeyNotFoundException) { }
                        break;
                }
            });
        }

vancem · 2018-03-16T01:45:30Z

@danmosemsft

My thinking about the problem also suggests that one node loops are likely to be the most common.

That suggest that simply testing for the single node loop at formation time (which is clearly wrong), may be a sufficient mitigation. (we don't have to be perfect, just common, if we catch even 20% of the errors then give it 5X more time and it will fail (and this is just an alert).

Another thought is that it is probably MORE common that nodes are lost (the link lists break). It occurs to me that there are pretty strong invariants we can take advantage of. In particular We keep both a freelist Count and a freelist (I don't know why it seems redundant). We could also keep the freeList count in the free nodes If the value stored in the nodes ever differed from the freeList count that would indicate a invariant break, and we could throw. This should be pretty cheep and very sensitive (much more than the loop detectore).

danmoseley · 2018-03-16T04:18:47Z

Makes sense. Cc @jkotas any concerns/thoughts

benaadams · 2018-03-16T04:38:44Z

Could do something like: https://github.com/dotnet/coreclr/compare/master...benaadams:dictionary-concurrent?expand=1

Isn't 100% and not sure about perf; but would catch a variety of situations (100% of spins, some double modify, change prior to return...)

Could make it 100% with Interlocked but :-/

jkotas · 2018-03-16T04:47:26Z

Agree that it make sense to do something here. I think we should guarantee that the Dictionary never hangs.

If a large scale clould service starts hitting it in production today when stars align, they are in trouble - spinning servers are much worse problem than crashed servers. Detecting the imminent hang 20% of the time does not get them out of the trouble.

danmoseley · 2018-03-16T04:49:21Z

@stephentoub FHI

jkotas · 2018-03-16T04:56:08Z

For insert, we can piggy back on the existing collisionCount local to detect the hang.

benaadams · 2018-03-16T05:00:58Z

For insert, we can piggy back on the existing collisionCount local to detect the hang.

True, updated suggestion https://github.com/dotnet/coreclr/compare/master...benaadams:dictionary-concurrent?expand=1

jkotas · 2018-03-16T05:14:26Z

volatile int _version;

This is going to hurt on non-Intel HW. I think we should just start with the simple change that just detects the hangs. Once we get feedback that it is not sufficient, we can improve upon it.

benaadams · 2018-03-16T05:18:52Z

Should it be a failfast error?

danmoseley · 2018-03-16T05:19:24Z

Are the version checks adding anything over putting the simpleloops >= entries.Length check in each of the places we loop over entries? Simple easy to understand and provably catches all hangs?

benaadams · 2018-03-16T05:24:33Z

They catch data structure corruption that doesn't necessarily lead to a hang (e.g. Earlier in a chain of concurrent uses)

Though slowing Dictionary down makes me sad

danmoseley · 2018-03-16T05:25:15Z

IMO a Failfast in release is only for indeterminate corruption or security violation. It has to be bad enough that it is better to immediately destroy all process state. If we supported UI apps this is a case where a user could reasonably try to save their state, or at least the app could try to write to a recovery file.

If it helps we could throw something more exotic than InvalidOperationException that is less likely to be blindly ignored.

danmoseley · 2018-03-16T05:31:15Z

I suggest the goal should be a targeted fix just to turn hang into fault. It is a specially insidious case of corruption.

benaadams · 2018-03-16T05:31:38Z

Concurrent use of the Dictionary could lead to a torn write (between two Inserts) of an Entry with a key not matching it's value; if you are using the data to store user info or make decision you will be in a very bad way (cpu usage aside)

benaadams · 2018-03-16T05:41:36Z

Could "fault" the Dictionary on detection and make every operation on it fail after; rather than fail fast? (as per enumerator)

danmoseley · 2018-03-16T05:45:29Z

You could just null out _buckets... It's useless at that point anyway. It leaves entries rooted, which might be useful in dump analysis.

stephentoub · 2018-03-16T11:47:27Z

Concurrent use of the Dictionary could lead to a torn write (between two Inserts) of an Entry with a key not matching it's value; if you are using the data to store user info or make decision you will be in a very bad way (cpu usage aside)

That's true for most invalid concurrent access to shared state. I agree with Dan's assessment of needing a very targeted change; let's not let this scope creep.

danmoseley · 2018-03-16T13:05:14Z

@benadams any interest in measuring the minimal loop counter change, since you already got started? And I know you have a special interest in Dictionary perf...

vancem · 2018-03-16T15:40:17Z

I want to push back on the desire catching 100% (or near) of the hangs.

Yes, if these hangs happen in production it is unfortunate, but it is no worse than now, AND it takes only one instance of this throw for users to be alerted that there is a problem. Moreover, as Ben has pointed out, hangs may be the most obvious issue but frankly I were running a service I would be MORE concerned about the fact that this table is very likely dropping entries or even returning the wrong value for a key. That is MUCH more difficult to debug, and arguably more impactful to the service (since a HANG is a very good indicator that something is wrong).

Consider if we have the 100% hang detection our users find their problem in 1 day. If we have something that only detects 50% it doubles that to 2 days. I don't think that is that bad. The important part is that they find it.

Thus I think the goal here is to have hopefully unnoticeable amount of perf/complexity impact.

Ben's change is pretty light weight (but heavier than Fengs original idea). However I would recommend that we only do it on MUTATORS of the dictionary. Yes, that cuts down the likelihood that you catch the concurrent access quickly, but frankly fetches are much more common, and concurrent fetches ARE generally benign. This probably cuts the overhead by factor of 2 or more (since reads probably outnumber writes by that amount).

Let's not over-engineer this. Technically this is a breaking change (it is DESIGNED to throw more exceptions than before). That is one of the benefits of Feng's original change (it turns only HANGS into exceptions, now we detecting more errors, but that is actually BAD for compatibility).

It should be simple, and very inexpensive. Everything else is negotiable.

I am OK with Ben's change if we confine it to mutators (but we should measure the impact). We should centralize the throw because we may need a compat switch to turn it off. .

jkotas · 2018-03-16T16:07:27Z

one of the benefits of Feng's original change (it turns only HANGS into exceptions)

Yes, we should equivalent of the Feng's change. Turn HANGs into exceptions, and nothing else for now. I agree that we should not over-engineer this.

However, I do not understand why you are pushing back on detecting 100% of the hangs. It is exactly what Feng's change has done, and it is very simple.

jkotas · 2018-03-16T16:14:49Z

Consider if we have the 100% hang detection our users find their problem in 1 day. If we have something that only detects 50% it doubles that to 2 days. I don't think that is that bad

I suspect that folks responsible for keeping large sites alive would have different opinion about this. Living with a pager makes your coding style more defensive.

benaadams · 2018-03-16T16:18:48Z

@benadams any interest in measuring the minimal loop counter change, since you already got started?

Then you'll ask me to write tests ;)

vancem · 2018-03-16T16:19:38Z

I am OK with Fengs change (simply detecting hangs). It is simple and low cost (again we should measure). I may have been misinterpreting feedback, by my pushback on 100% was about the more complex changes that were suggested that catch more stuff (with _version).

If we are going with Fengs suggestion I am happy.

vancem · 2018-03-16T16:25:33Z

I should note that while Feng's change has the nice compat feature, my expectation is that hangs only account for a small fraction of the corruption that is happening when you use things like Dictionary in concurrent environments. Lost entries seem like the most likely problem (and lost free list entries). We are not detecting these (almost by design because of compat), which I think is unfortunate (which is why I would be OK with a lean version of Ben's idea). The fact that my service is just operating incorrectly a small fraction of the time .(in ways likely to be subtle), but in a way that is likely to grow over time (as the table becomes more and more corrupt) is actually pretty scary.

But I am OK with just taking a step.

benaadams · 2018-03-16T16:40:05Z

Added PR just to break the loops dotnet/coreclr#16991

benaadams · 2018-07-19T00:00:18Z

Its working :)

dotnet/efcore#12713

jkotas closed this as completed Mar 17, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.1.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag unprotected concurrent dictionary access #25482

Flag unprotected concurrent dictionary access #25482

danmoseley commented Mar 15, 2018

danmoseley commented Mar 16, 2018

vancem commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018 •

edited

Loading

jkotas commented Mar 16, 2018 •

edited

Loading

danmoseley commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

stephentoub commented Mar 16, 2018

danmoseley commented Mar 16, 2018

vancem commented Mar 16, 2018

jkotas commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

vancem commented Mar 16, 2018

vancem commented Mar 16, 2018

benaadams commented Mar 16, 2018

benaadams commented Jul 19, 2018

Flag unprotected concurrent dictionary access #25482

Flag unprotected concurrent dictionary access #25482

Comments

danmoseley commented Mar 15, 2018

danmoseley commented Mar 16, 2018

vancem commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018 • edited Loading

jkotas commented Mar 16, 2018 • edited Loading

danmoseley commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

danmoseley commented Mar 16, 2018

benaadams commented Mar 16, 2018

benaadams commented Mar 16, 2018

danmoseley commented Mar 16, 2018

stephentoub commented Mar 16, 2018

danmoseley commented Mar 16, 2018

vancem commented Mar 16, 2018

jkotas commented Mar 16, 2018

jkotas commented Mar 16, 2018

benaadams commented Mar 16, 2018

vancem commented Mar 16, 2018

vancem commented Mar 16, 2018

benaadams commented Mar 16, 2018

benaadams commented Jul 19, 2018

benaadams commented Mar 16, 2018 •

edited

Loading

jkotas commented Mar 16, 2018 •

edited

Loading