-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single epoll thread per 28 cores #35800
Conversation
Tagging subscribers to this area: @dotnet/ncl |
kestrel-linux-transport doesn't use ConcurrentDictionary, instead a regular Dictionary with a lock is used. The lookup is performed up-front, which improves locality. Previous benchmarks for ConcurrentDictionary vs Dictionary+lock showed only small difference. Maybe we'll see a bigger difference for this scenario. |
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Show resolved
Hide resolved
// the goal is to have a dedicated generic instantiation and using: | ||
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&) | ||
// instead of: | ||
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious that this would perform better. Why is the dedicated generic instantiation better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.
as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever update the value for an existing key in the dictionary?
We don't. We use incremental keys for each AsyncContext. When we run out of keys, we start a new SocketEngine.
as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that?
I'm curious what it is.
|
From the graph it looks like the contention is coming from |
👍 The lock seems to be rarely taken on other paths could be taken here around the whole inner loop with faster lookups. |
Ah I didn't read that right, nevermind |
Was the 20K clients test also with 1 epoll thread, or would it use 20? I figure with 1 epoll thread |
For 20k connections from a single load machine: Before your change it was 740k RPS with 14 epoll threads (Cores / 2) With your change and single epoll thread it dropped to 715k RPS, with the microoptimizations from this PR it's 740k again. |
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Show resolved
Hide resolved
// the goal is to have a dedicated generic instantiation and using: | ||
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&) | ||
// instead of: | ||
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.
as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Show resolved
Hide resolved
…roves the perf for ARM and for scenarios with MANY clients
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
Update: from the initial data that I have it looks like switching from concurrent to a regular dictionary under lock (thanks for a great hint @tmds !) combined with the few micro-optimizations is enough to always have a single epoll thread. I am now going to run the program that is going to run a matrix of:
And share the results. If there are no regressions, I am going to ask you for review again. For now please don't merge it. |
How to read the results
Colors: default MS Excel color scheme where red means the worst and green means the best result. x64 12 Cores (the
|
I've shared the numbers from my most recent experiment in a comment above. PTAL Based on these numbers I came up with the following proposal for the heuristic that determines the number of epoll threads:
The code that I've just pushed gives the following results: ratio is
|
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
|
||
// the data that stands behind this heuristic can be found at https://github.com/dotnet/runtime/pull/35800#issuecomment-624719500 | ||
// the goal is to have a single epoll thread per every 28 cores | ||
const int coresPerSingleEpollThread = 28; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks confirm an observation we had made from the perftraces. It may interesting to put in the comment:
TechEmpower JSON platform benchmark (which has a low workload per request) shows the epoll thread is fully loaded on a 28-core machine. We add 1 epoll thread per 28 cores to avoid it being a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can already hear all the complaints about this line ... me being the first concerned. I think we should not have any heuristic with such a value. What if tomorrow we decide to use a different hardware to do benchmarks? We should have some heuristics that are good for the general cases, and allow customer to define custom values that might be better for them. In our case, in the TE repository we would then define an ENV with a number of epoll thread we want. Same for ARM probably, which might depend on each vendor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can already hear all the complaints about this line ... me being the first concerned.
I guess you refer to the suggestion I made? I prefer to put it explicitly here than to have it implicit in the linked comment.
From the benchmarking we did, TE platform JSON benchmark represents the lowest threadpool workload per request. This means the epoll thread will sooner become the bottleneck than on the other benchmarks.
We should have some heuristics that are good for the general cases
This heuristic is good for the general case, It uses less epoll threads than the previous heuristic (which was a guess with little benchmarking done) and achieves higher performance.
and allow customer to define custom values that might be better for them.
This is now possible, the count can be set explicitly using the envvar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The env var is good. But the number 28
within the code is my concern. Unless we say we need to pick one, but 28 because CITRINE should not be the reason IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik, on the 56-core machine (2-socket 28-core) are the numbers with or without the env vars COMPlus_Thread_UseAllCpuGroups=1
and COMPlus_GCCpuGroup=1
? Without those I suspect it would only be using one CPU group and behaving as a single-socket 28-core machine, with those it should try to use both sockets. The numbers seem to be similar to the 28-core machine and it's a bit odd that 2 epoll threads do better there, though there may be other things going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be common to set those env vars on multi-numa-node machines if the intention is to scale up. Might also be interesting to try the AMD machine with those env vars since it also has multiple numa nodes. Not suggesting for this change or anything but it might provide more insights on heuristics for number of epoll threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh nevermind, from a brief look it almost looks like all of the the CPU group stuff is disabled on Linux and those env vars may not have any effect. Sorry for the distraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be common to set those env vars on multi-numa-node machines if the intention is to scale up.
Thanks for pointing this out. I am going to run this config as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls see my latest comment :) there might be more work to do there in the VM, I'm not up-to-date on what's happening there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sebastienros that this value is fairly arbitrary, based on the specific (and limited) hardware we've tested on. Does the heuristic hold up on machines with a similar number of cores but a different distribution across nodes? What about when hyperthreading is disabled? Did we try it with cloud VMs?
We see 1 epoll thread is enough to load a 28 core machine (Citrine) with a benchmark that has low threadpool workload vs epoll workload (TE JSON platform).
That's what is captured by coresPerSingleEpollThread = 28
.
This heuristic also works well in the likely cases that
- ProcessorCount is lower than 28
- for higher workloads per epoll workload
This heuristic isn't tuned for multi-node machines, or machines with 28+++ procs.
@stephentoub @adamsitnik I think we can remove the |
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
Environment.ProcessorCount >= 6 ? Environment.ProcessorCount / 2 : 1; | ||
#endif | ||
#pragma warning restore CA1802 | ||
private static readonly int s_engineCount = GetEnginesCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be s_maxEngineCount? We won't always have this many, but we may grow to this many based on the number of concurrent sockets, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub I've suggested to remove that logic as part of this PR (#35800 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmds You are most probably right. The only use case for keeping it is a machine with many cores and very few connections. Which should be uncommon.
Would you prefer me to remove it now or would you like to do this in your upcoming PR that is going to enable the "inlining"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik remove it here, it is unrelated to inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmds I am going to merge it as it is right now as I would really love to see the update numbers. I am going to send a PR with MinHandles logic removal today or tomorrow.
|
||
return Math.Min(result, Environment.ProcessorCount / 2); | ||
return Math.Max(1, (int)Math.Round(Environment.ProcessorCount / (double)coresPerEngine)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and "round" it up, in a way that 29 cores gets 2 epoll threads
So now anyting below 44 cores on x64 will get 1 thread? Then 2 after 76 cores ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and this should be enough for vast majorty of real-life scenarios.
TechEmpower is super artificial (super small socket reads and writes & extremely high load) and even under such high load, one engine (producer) is capable of keeping busy up to 30 CPU Cores (8 on ARM). This is possible thanks to the amazing work that @kouvel has done in #35330 In real-life scenario, nobody should ever need more than one epoll thread for the entire app. But we can't predict all possible usages and I believe that these numbers (30 & 8) are safe because it would be super hard to generate more network load. I've simplified the heuristic and added explanation. |
Edit: this PR has evolved over time, please see the next comments for accurate description