Non-blocking ConcurrentDictionary #50337

VSadov · 2021-03-28T18:30:03Z

Not done yet. In progress.

initial port
run existing tests. make sure all passes.
port some additional tests from NonBlocking repo.
do some cleanup, maybe add more comments.

ghost · 2021-03-28T18:30:07Z

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Not done yet. In progress.

initial port
run existing tests. make sure all passes.
add more tests
do some cleanup, maybe add more comments.

Author:	VSadov
Assignees:	-
Labels:	`area-System.Collections`
Milestone:	-

danmoseley · 2021-03-29T00:27:40Z

@AArnott does VS have a benchmark that stresses ConcurrentDictionary? Or perhaps a usage pattern we sold be sure to measure.

stephentoub · 2021-03-29T00:41:50Z

Thanks for making progress on this! I remember talking to you about trying to replace the guts of ConcurrentDictionary with your implementation over six years ago 😄

I'll look forward to reviewing the implementation. In the meantime, there are some benchmarks in dotnet/performance. However, from a performance perspective, the most important thing is that reads are as fast or faster, and then obviously the purpose of such an implementation would be to improve the scalability of updates, so that should be proven, while also not meaningfully regressing at lower core counts. It's also relatively common to create ConcurrentDictionary instances even if they're not heavily used, so measuring just the raw overhead of constructing an instance (including allocations) is of interest.

VSadov · 2021-03-29T01:08:53Z

Thanks for making progress on this! I remember talking to you about trying to replace the guts of ConcurrentDictionary with your implementation over six years ago 😄

I was not happy with the guarantee on when dead keys can be GC-ed (eventually, maybe, if/when resize happens).
That works fine for a third party component, since the user has a choice and often does not care. It would not seem right for a default platform implementation. For a while I was not sure it is even possible to have better guarantees without significantly degrading perf or lock freedom.

davidfowl · 2021-03-29T01:32:29Z

What are the key differences here? What does non blocking mean? No locks?

VSadov · 2021-03-29T02:01:29Z

@davidfowl - it is basically lock-free without ambiguities like "are spin-locks ok?".
Imagine OS can preempt a thread at a random location for an hour. Non-blocking means the rest of the system still makes progress.

AArnott · 2021-03-29T02:44:53Z

@danmoseley Jeff Robison's file watcher service in VS makes heavy concurrent use of ConcurrentDictionary I believe. I suspect he has stress tests for it. Do you want to reach out to him?

danmoseley · 2021-03-29T03:27:26Z

@VSadov would that be useful?

stephentoub · 2021-03-29T03:34:29Z

You don't need to go far to find uses of ConcurrentDictionary. Every socket operation on Linux performs a read on a ConcurrentDictionary (for a mapping from socket file descriptor to corresponding context). Every HTTP request on an HttpClient/SocketsHttpHandler performs a read on a ConcurrentDictionary (for the connection pool). Every static method on Regex reads a ConcurrentDictionary (for the regex cache). Every serialization with JsonSerializer reads a ConcurrentDictionary (to find the prevously catalogued data for the target type). Etc.

Tornhoof · 2021-03-29T04:14:50Z

As for Benchmarks you could take a look at the code provided by Microsoft FASTER. They have perf tests around different ConcurrentDictionary use cases.
https://github.com/Microsoft/FASTER/wiki/Performance-of-FASTER-in-C%23
and https://github.com/microsoft/FASTER/blob/master/cs/benchmark/Program.cs

gfoidl · 2021-03-29T08:34:29Z

src/libraries/System.Private.CoreLib/src/System/Collections/HashHelpers.cs

@@ -109,5 +109,19 @@ public static uint FastMod(uint value, uint divisor, ulong multiplier)
            Debug.Assert(highbits == value % divisor);
            return highbits;
        }
+
+        // returns 2^x >= size
+        public static int AlignToPowerOfTwo(int size)


Should use

runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

Line 76 in 21734a4

internal static uint RoundUpToPowerOf2(uint value)

?
(#43135 for reference).

@gfoidl This was written long time ago. I suspected we have LZCNT-based helper for this by now when porting over, but did not have time to look it up yet. Thanks!!

VSadov · 2021-04-04T23:24:53Z

I have run concurrent dictionary benchmarks from dotnet/performance.

These benchmarks for the most part test throughput of singlethreaded operations with a small 512-element dictionary. This is challenging for an implementation designed to scale since the extra complexity to improve scalability is not very beneficial.
With that in mind I think nonblocking implementation performed pretty well. There are some regressions, but generally within 20%.

https://gist.githubusercontent.com/VSadov/8c8a7ab5a5380d5cc32c870af1162de4/raw/491434ade3075abaddac5eb60997f3b6cb3563af/bench01_512.txt

Here is also the same set of tests with a larger 512K dictionary size. Nonblocking fares better here. It is still mostly singlethreaded.

https://gist.githubusercontent.com/VSadov/c9c1c52fbbcec532774a1ab5f749478f/raw/e5eee09793642a2fd0a554053a772b07b6de7207/bench01_512K.txt

Few things to note:

baseline is runnig with nondefault concurrency level 4. I guess this is for results to be more comparable between machines or to optimize a bit more for the low threadcount.
Nonblocking implementation scales as needed and does not have a concept of concurrencey level.
regression in CtorGivenSize tests is expected and ByDesign. A closed-hash (open addressing) implementation allocates the space for cells upfront. Once the dictionary is filled to capacity it is expected to amortize this cost and likely get ahead. A test that preallocates a dictionary of certain capacity without actually filling it will not see the second part.

danmoseley · 2021-04-05T02:19:54Z

These benchmarks for the most part test throughput of singlethreaded operations with a small 512-element dictionary.

Do you want to first augment those with some others, which perhaps are more realistic?

stephentoub · 2021-04-05T02:30:15Z

The values that concern me the most are the ones for reading on smaller dictionaries, e.g. TryGetValue / Contains / etc. Some of these appear to have taken huge hits, with ratios like 1.42, 1.43, 1.26, etc. And such reading on a ConcurrentDictionary today scales well (in terms of concurrency) and is wait-free. It's also one of the most heavily exercised code paths: while some CDs are indeed about purely adding concurrently and building up a large dictionary as a result of some processing, the most common case is sporadic adds but heavy reads, e.g. for a cache.

iSazonov · 2021-04-05T07:24:18Z

It is hardly possible for the same code to work equally well in all scenarios. Can think of "strategies" similar to those that are implemented for FileStream?

the most common case is sporadic adds but heavy reads, e.g. for a cache.

What if we implement a separate highly optimized class for this most popular case?

(In PowerShell repo we have interesting case related to ConcurrentDictionary performance. Normally users never use many breakpoints in scripts but Pester sets breakpoints on every line in code coverage scenario. And this work very slow. It would be interesting to know if this PR will solve this scenario or we will still have to accept PowerShell/PowerShell#14953)

VSadov · 2021-04-05T18:08:09Z

@stephentoub Right. The baseline implementation is good at Get/ContainsKey scenario with small dictionaries and in particular with small int->int dictionaries. These are among the easiest and fastest scenarios for both implementations. Since the advantage does not hold with increase in size, the reason for the differences is likely to be computational (not the memory access patterns).

An extreme case would be reading from a one-element int->int dictionary.
I wonder if something can be gained by examining the JIT`ed code.

Ultimately, having pauseless rehashes requires a "read barrier", so some small cost will be present on the read path due to that.
At larger scale the cost seems to be absorbed by other gains, but the "easy" case is tough, primarily because it is "easy", thus offering little opportunity to save on something.

… testcase wants.

… taken.

VSadov · 2021-05-21T21:30:56Z

A brief update.
I have made a few changes to improve performance a bit. Nothing architectural, just saving a few cycles here and there. The new results are:

512 elements: https://gist.githubusercontent.com/VSadov/4cdda8fedd9b33b857018b58849b1a23/raw/88b312a2ca0c797201cfad1555a314eb2fc7f641/bench01_512.txt

512K elements: https://gist.githubusercontent.com/VSadov/17ce56428cd6c76818ede817737d4034/raw/1cd7d4f83e5d7550d163678c02d590cb041ea201/bench01_512K.txt

For the 512K elements everything is generally faster, except foreach, which may not be very interesting.
For the 512 elements – with int->int dictionary a successful get is 21% slower and missing get is 32% slower.

I have doubts that Get scenario could be further improved in a significant way. Ultimately there is extra cost of the “read barrier” and there are open-hash/closed-hash differences that make some scenarios faster and some slower.
(ex: reference TValue is faster since the value lives directly in the table with no extra space or indirection)

As with any hashtable there are ways to shift the cost from one scenario to another. Tweaking the resize heuristics, for example, can improve perf by making table less occupied and reducing average reprobe, at the cost of consuming more memory. There are ways to improve performance of lookup misses at the cost of lookup hits. I tried a few changes like that, but it felt too much like fitting into a particular benchmark and overall tradeoffs did not seem good.

I also tried running some TechEmpower benchmarks (plaintext, platform.json with various number of connections, on Linux), but I did not see differences that would be consistent and reproducible. Even though sockets use concurrent dictionary, it does not seem to be big enough use to show up on end-to-end results.

I need to think what all this means, but it seems to me if Get on a small int->int dictionary is the bar, the new implementation would not meet the bar.

sharwell · 2021-05-21T21:40:50Z

Imagine OS can preempt a thread at a random location for an hour. Non-blocking means the rest of the system still makes progress.

For me, this sounds like "wait free", as a strict subset of "lock free".

sharwell · 2021-05-21T21:46:08Z

...tions.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary/Counter/Counter32.cs

+        /// Returns the approximate value of the counter at the time of the call.
+        /// </summary>
+        /// <remarks>
+        /// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed.


❔ Is 'stale' a better word choice here? Concern is the reader could interpret "slightly delayed" to mean "slower to return". Stale has its own downsides (e.g. is it eventually consistent?) so would leave the call to y'all.

Suggested change

/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed.

/// EstimatedValue could be significantly cheaper to obtain, but may be stale.

sharwell · 2021-05-21T21:49:39Z

...ons.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary/Counter/CounterBase.cs

+            if (IntPtr.Size == 4)
+            {
+                uint addr = (uint)&cellCount;
+                return (int)(addr % cellCount);
+            }
+            else
+            {
+                ulong addr = (ulong)&cellCount;
+                return (int)(addr % cellCount);
+            }


❔ Can this use nuint now?

Strangely enough (nuint)&cellCount; in an error in C#.

(nuint)(&cellCount); works though, so I will use that. Thanks!

Also can use FastMod here, although it is cumbersome since it needs two values that must match and thus updated atomically.

VSadov · 2021-05-22T16:32:39Z

@sharwell "Wait-free" is generally understood as non-blocking with an upper bound on the number of steps per action. So if a threads A and B collide, in non-blocking case thread A can simply retry, possibly after helping B – to make sure we are not again in the same situation after going around. With wait-free, we would also need to ensure it is not the same thread A who keeps pulling the short straw. That can be arranged by adding a turn/ticket system and maybe some roll-back ability, so that B could yield even if it made more progress than A.

In practice wait-free guarantee is rarely necessary. If collisions are rare, or it can be ensured that they are rare – by doing exponential back-off for example, then there is typically more than enough “long-term fairness” in the system without explicit guarantees.

ghost · 2021-06-21T22:00:12Z

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

VSadov added the area-System.Collections label Mar 28, 2021

VSadov changed the title ~~Non blocking ConcurrentDictionary~~ Non-blocking ConcurrentDictionary Mar 28, 2021

gfoidl reviewed Mar 29, 2021

View reviewed changes

runfoapp bot mentioned this pull request Apr 5, 2021

Helix DNS failures causing work item failures in CI #50750

Closed

VSadov force-pushed the NonBlocking branch 2 times, most recently from 256d46c to 1c8bb63 Compare May 9, 2021 20:40

VSadov added 9 commits May 19, 2021 17:09

initial port

cab5fbe

a few cleanups, some comments

0f4a468

valueFactory vaidation gone missing after refactoring

9c1e158

Disable NullableAttributesOnPublicApiOnly test for now. Not sure what…

9ebcd6a

… testcase wants.

a few minor cleanups after porting.

60e04ba

ContainsKey does not need the value

769bbf6

remove use of GetTypeInfo

d964a8c

some refactoring

a7c2923

a few changes based on the benchmark results

022114f

fix after rebasing onto main

ae40c9c

VSadov force-pushed the NonBlocking branch from 1c8bb63 to ae40c9c Compare May 20, 2021 00:09

tweak Etw tests for AcquiringAllLocksEventId, since no locks are ever…

7736b4b

… taken.

sharwell reviewed May 21, 2021

View reviewed changes

use nuint when computing counter stripe index

a86ab16

ghost closed this Jun 21, 2021

ghost locked as resolved and limited conversation to collaborators Jul 21, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking ConcurrentDictionary #50337

Non-blocking ConcurrentDictionary #50337

VSadov commented Mar 28, 2021 •

edited

Loading

ghost commented Mar 28, 2021

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021

VSadov commented Mar 29, 2021 •

edited

Loading

davidfowl commented Mar 29, 2021

VSadov commented Mar 29, 2021

AArnott commented Mar 29, 2021 •

edited

Loading

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021

Tornhoof commented Mar 29, 2021

gfoidl Mar 29, 2021

VSadov Mar 29, 2021

VSadov commented Apr 4, 2021

danmoseley commented Apr 5, 2021 •

edited

Loading

stephentoub commented Apr 5, 2021 •

edited

Loading

iSazonov commented Apr 5, 2021

VSadov commented Apr 5, 2021 •

edited

Loading

VSadov commented May 21, 2021 •

edited

Loading

sharwell commented May 21, 2021

sharwell May 21, 2021 •

edited

Loading

sharwell May 21, 2021

VSadov May 22, 2021

VSadov May 22, 2021

VSadov May 22, 2021 •

edited

Loading

VSadov commented May 22, 2021 •

edited

Loading

ghost commented Jun 21, 2021

	/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed.
	/// EstimatedValue could be significantly cheaper to obtain, but may be stale.

Non-blocking ConcurrentDictionary #50337

Non-blocking ConcurrentDictionary #50337

Conversation

VSadov commented Mar 28, 2021 • edited Loading

ghost commented Mar 28, 2021

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021

VSadov commented Mar 29, 2021 • edited Loading

davidfowl commented Mar 29, 2021

VSadov commented Mar 29, 2021

AArnott commented Mar 29, 2021 • edited Loading

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021

Tornhoof commented Mar 29, 2021

gfoidl Mar 29, 2021

Choose a reason for hiding this comment

VSadov Mar 29, 2021

Choose a reason for hiding this comment

VSadov commented Apr 4, 2021

danmoseley commented Apr 5, 2021 • edited Loading

stephentoub commented Apr 5, 2021 • edited Loading

iSazonov commented Apr 5, 2021

VSadov commented Apr 5, 2021 • edited Loading

VSadov commented May 21, 2021 • edited Loading

sharwell commented May 21, 2021

sharwell May 21, 2021 • edited Loading

Choose a reason for hiding this comment

sharwell May 21, 2021

Choose a reason for hiding this comment

VSadov May 22, 2021

Choose a reason for hiding this comment

VSadov May 22, 2021

Choose a reason for hiding this comment

VSadov May 22, 2021 • edited Loading

Choose a reason for hiding this comment

VSadov commented May 22, 2021 • edited Loading

ghost commented Jun 21, 2021

VSadov commented Mar 28, 2021 •

edited

Loading

VSadov commented Mar 29, 2021 •

edited

Loading

AArnott commented Mar 29, 2021 •

edited

Loading

danmoseley commented Apr 5, 2021 •

edited

Loading

stephentoub commented Apr 5, 2021 •

edited

Loading

VSadov commented Apr 5, 2021 •

edited

Loading

VSadov commented May 21, 2021 •

edited

Loading

sharwell May 21, 2021 •

edited

Loading

VSadov May 22, 2021 •

edited

Loading

VSadov commented May 22, 2021 •

edited

Loading