-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-blocking ConcurrentDictionary #50337
Conversation
Tagging subscribers to this area: @eiriktsarpalis Issue DetailsNot done yet. In progress.
|
@AArnott does VS have a benchmark that stresses ConcurrentDictionary? Or perhaps a usage pattern we sold be sure to measure. |
Thanks for making progress on this! I remember talking to you about trying to replace the guts of ConcurrentDictionary with your implementation over six years ago 😄 I'll look forward to reviewing the implementation. In the meantime, there are some benchmarks in dotnet/performance. However, from a performance perspective, the most important thing is that reads are as fast or faster, and then obviously the purpose of such an implementation would be to improve the scalability of updates, so that should be proven, while also not meaningfully regressing at lower core counts. It's also relatively common to create ConcurrentDictionary instances even if they're not heavily used, so measuring just the raw overhead of constructing an instance (including allocations) is of interest. |
I was not happy with the guarantee on when dead keys can be GC-ed (eventually, maybe, if/when resize happens). |
What are the key differences here? What does non blocking mean? No locks? |
@davidfowl - it is basically lock-free without ambiguities like "are spin-locks ok?". |
@danmoseley Jeff Robison's file watcher service in VS makes heavy concurrent use of ConcurrentDictionary I believe. I suspect he has stress tests for it. Do you want to reach out to him? |
@VSadov would that be useful? |
You don't need to go far to find uses of ConcurrentDictionary. Every socket operation on Linux performs a read on a ConcurrentDictionary (for a mapping from socket file descriptor to corresponding context). Every HTTP request on an HttpClient/SocketsHttpHandler performs a read on a ConcurrentDictionary (for the connection pool). Every static method on Regex reads a ConcurrentDictionary (for the regex cache). Every serialization with JsonSerializer reads a ConcurrentDictionary (to find the prevously catalogued data for the target type). Etc. |
As for Benchmarks you could take a look at the code provided by Microsoft FASTER. They have perf tests around different ConcurrentDictionary use cases. |
@@ -109,5 +109,19 @@ public static uint FastMod(uint value, uint divisor, ulong multiplier) | |||
Debug.Assert(highbits == value % divisor); | |||
return highbits; | |||
} | |||
|
|||
// returns 2^x >= size | |||
public static int AlignToPowerOfTwo(int size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use
runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs
Line 76 in 21734a4
internal static uint RoundUpToPowerOf2(uint value) |
(#43135 for reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfoidl This was written long time ago. I suspected we have LZCNT-based helper for this by now when porting over, but did not have time to look it up yet. Thanks!!
I have run concurrent dictionary benchmarks from dotnet/performance. These benchmarks for the most part test throughput of singlethreaded operations with a small 512-element dictionary. This is challenging for an implementation designed to scale since the extra complexity to improve scalability is not very beneficial. Here is also the same set of tests with a larger 512K dictionary size. Nonblocking fares better here. It is still mostly singlethreaded. Few things to note:
|
Do you want to first augment those with some others, which perhaps are more realistic? |
The values that concern me the most are the ones for reading on smaller dictionaries, e.g. TryGetValue / Contains / etc. Some of these appear to have taken huge hits, with ratios like 1.42, 1.43, 1.26, etc. And such reading on a ConcurrentDictionary today scales well (in terms of concurrency) and is wait-free. It's also one of the most heavily exercised code paths: while some CDs are indeed about purely adding concurrently and building up a large dictionary as a result of some processing, the most common case is sporadic adds but heavy reads, e.g. for a cache. |
It is hardly possible for the same code to work equally well in all scenarios. Can think of "strategies" similar to those that are implemented for FileStream?
What if we implement a separate highly optimized class for this most popular case? (In PowerShell repo we have interesting case related to ConcurrentDictionary performance. Normally users never use many breakpoints in scripts but Pester sets breakpoints on every line in code coverage scenario. And this work very slow. It would be interesting to know if this PR will solve this scenario or we will still have to accept PowerShell/PowerShell#14953) |
@stephentoub Right. The baseline implementation is good at Get/ContainsKey scenario with small dictionaries and in particular with small An extreme case would be reading from a one-element Ultimately, having pauseless rehashes requires a "read barrier", so some small cost will be present on the read path due to that. |
256d46c
to
1c8bb63
Compare
A brief update. For the 512K elements everything is generally faster, except I have doubts that Get scenario could be further improved in a significant way. Ultimately there is extra cost of the “read barrier” and there are open-hash/closed-hash differences that make some scenarios faster and some slower. As with any hashtable there are ways to shift the cost from one scenario to another. Tweaking the resize heuristics, for example, can improve perf by making table less occupied and reducing average reprobe, at the cost of consuming more memory. There are ways to improve performance of lookup misses at the cost of lookup hits. I tried a few changes like that, but it felt too much like fitting into a particular benchmark and overall tradeoffs did not seem good. I also tried running some TechEmpower benchmarks (plaintext, platform.json with various number of connections, on Linux), but I did not see differences that would be consistent and reproducible. Even though sockets use concurrent dictionary, it does not seem to be big enough use to show up on end-to-end results. I need to think what all this means, but it seems to me if Get on a small |
For me, this sounds like "wait free", as a strict subset of "lock free". |
/// Returns the approximate value of the counter at the time of the call. | ||
/// </summary> | ||
/// <remarks> | ||
/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❔ Is 'stale' a better word choice here? Concern is the reader could interpret "slightly delayed" to mean "slower to return". Stale has its own downsides (e.g. is it eventually consistent?) so would leave the call to y'all.
/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed. | |
/// EstimatedValue could be significantly cheaper to obtain, but may be stale. |
if (IntPtr.Size == 4) | ||
{ | ||
uint addr = (uint)&cellCount; | ||
return (int)(addr % cellCount); | ||
} | ||
else | ||
{ | ||
ulong addr = (ulong)&cellCount; | ||
return (int)(addr % cellCount); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❔ Can this use nuint
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strangely enough (nuint)&cellCount;
in an error in C#.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nuint)(&cellCount);
works though, so I will use that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can use FastMod
here, although it is cumbersome since it needs two values that must match and thus updated atomically.
@sharwell "Wait-free" is generally understood as non-blocking with an upper bound on the number of steps per action. So if a threads A and B collide, in non-blocking case thread A can simply retry, possibly after helping B – to make sure we are not again in the same situation after going around. With wait-free, we would also need to ensure it is not the same thread A who keeps pulling the short straw. That can be arranged by adding a turn/ticket system and maybe some roll-back ability, so that B could yield even if it made more progress than A. In practice wait-free guarantee is rarely necessary. If collisions are rare, or it can be ensured that they are rare – by doing exponential back-off for example, then there is typically more than enough “long-term fairness” in the system without explicit guarantees. |
Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it. |
Not done yet. In progress.