Single epoll thread per 28 cores #35800

adamsitnik · 2020-05-04T17:11:43Z

Edit: this PR has evolved over time, please see the next comments for accurate description

don't access static variable in a loop

ghost · 2020-05-04T17:11:45Z

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

tmds · 2020-05-04T17:53:21Z

kestrel-linux-transport doesn't use ConcurrentDictionary, instead a regular Dictionary with a lock is used. The lookup is performed up-front, which improves locality.

Previous benchmarks for ConcurrentDictionary vs Dictionary+lock showed only small difference. Maybe we'll see a bigger difference for this scenario.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

kouvel · 2020-05-04T17:55:39Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

+    // the goal is to have a dedicated generic instantiation and using:
+    // System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&)
+    // instead of:
+    // System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&)


Curious that this would perform better. Why is the dedicated generic instantiation better?

Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.

Do we ever update the value for an existing key in the dictionary?

We don't. We use incremental keys for each AsyncContext. When we run out of keys, we start a new SocketEngine.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that?

I'm curious what it is.

tmds · 2020-05-04T18:01:13Z

kestrel-linux-transport doesn't use ConcurrentDictionary, instead a regular Dictionary with a lock is used. The lookup is performed up-front, which improves locality.

Like EPollAsyncEngine.EPollThread.cs#L85-L104

kouvel · 2020-05-04T18:03:22Z

I got to the point that the biggest bottleneck is ConcurrentDictionary.TryGetValue and before I try any experiments with ConcurrentDictionary

From the graph it looks like the contention is coming from EventLoop(), if I'm reading it right that would be contention on the ConcurrentQueue and not the ConcurrentDictionary. If it's contention on the ConcurrentQueue it might be interesting to try having a non-contending overload/implementation of ConcurrentQueue.TryDequeue that differentiates between empty, contention, and successful dequeue (without spin-waiting). In the contention case then try to exit EventLoop() prematurely to avoid contention after scheduling another work item to replace it. My hope is that the new work item would not run too soon and it may decrease the parallelization a bit to make faster progress on the queue. Just a thought though, may or may not work.

kouvel · 2020-05-04T18:08:26Z

Previous benchmarks for ConcurrentDictionary vs Dictionary+lock showed only small difference. Maybe we'll see a bigger difference for this scenario.

👍 The lock seems to be rarely taken on other paths could be taken here around the whole inner loop with faster lookups.

kouvel · 2020-05-04T18:09:30Z

From the graph it looks like the contention is coming from EventLoop(), if I'm reading it right that would be contention on the ConcurrentQueue

Ah I didn't read that right, nevermind

kouvel · 2020-05-04T18:17:45Z

I am working on a PR that is going to set the number of epoll threads to 1.

While testing it for a big number of clients (20k) I've noticed a regression, for JSON Platform benchmark the RPS dropped from 780k to 715k.

Was the 20K clients test also with 1 epoll thread, or would it use 20? I figure with 1 epoll thread ConcurrentDictionary accesses shouldn't be contending at all, so there must have been more epoll threads. With 20K connections does it perform better or worse with 1 epoll thread? Lock still may be better in both cases.

adamsitnik · 2020-05-04T18:46:57Z

if I'm reading it right that would be contention on the ConcurrentQueue and not the ConcurrentDictionary

15.64% of time is spent in ConcurrentDictionary.TryGetValue while only 1.41% in ConcurrentQueueSegment.TryEnqueue

adamsitnik · 2020-05-04T18:50:02Z

Was the 20K clients test also with 1 epoll thread, or would it use 20?

For 20k connections from a single load machine:

Before your change it was 740k RPS with 14 epoll threads (Cores / 2)
After your change it was 780k RPS with 14 epoll threads (Cores / 2)

With your change and single epoll thread it dropped to 715k RPS, with the microoptimizations from this PR it's 740k again.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

stephentoub · 2020-05-04T19:12:48Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

+    // the goal is to have a dedicated generic instantiation and using:
+    // System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&)
+    // instead of:
+    // System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&)


Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

…roves the perf for ARM and for scenarios with MANY clients

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

adamsitnik · 2020-05-05T15:52:36Z

Update: from the initial data that I have it looks like switching from concurrent to a regular dictionary under lock (thanks for a great hint @tmds !) combined with the few micro-optimizations is enough to always have a single epoll thread.

I am now going to run the program that is going to run a matrix of:

Plaintext x Json x Fortunes
128, 256, 512, 1024, 20_000 connections
12, 28, 32 and 56 core machine
Before Parallelize epoll events on thread pool and process events in the same thread #35330, after Parallelize epoll events on thread pool and process events in the same thread #35330, this with Parallelize epoll events on thread pool and process events in the same thread #35330

And share the results. If there are no regressions, I am going to ask you for review again. For now please don't merge it.

adamsitnik · 2020-05-06T15:30:28Z

How to read the results

before #35330 means results before merging #35330
#35330 means code after merging #35330
xET yD means code after merging #35330 with the micro-optimizations from this PR, using x epoll threads, using y Dictionary. y: C stands for Concurrent while L for generic dictionary used under Lock. So 1ET CD means single epoll thread using Concurrent Dictionary.

Fortunes Batching means Fortunes Platform benchmark executed with a copy of Npgsql.dll provided by @roji that implements batching

Colors: default MS Excel color scheme where red means the worst and green means the best result.

x64 12 Cores (the `perf` machine)

Let's start with something simple:

As we can see, switching to a single epoll thread and using ConcurrentDictionary gives the best results - the 1ET CD column is the greenest one. No regressions, pure win.

There are two cases where having more epoll threads gives better results:

JsonPlatform using 512 connections. We could get 130k instead of 128k. The difference is so small that it's ignorable
PlaintextPlatform using 20_000 connections. The difference is small, but IMHO Plaintext is the most artificial benchmark (because of the pipelining and super small response) and making the heuristic more complex to get few extra % here is not worth it.

x64 28 Cores (Citrine, the TechEmpower machine)

TechEmpower hardware:

Again, switching to a single epoll thread and using ConcurrentDictionary gives the best results - the 1ET CD column is the greenest one.

There are few cases where having more epoll threads gives better results:

small and ignorable differences within the marigin of error:
- 300k vs 305k for Fortunes using 128 connections
- 311k vs 318k for Fortunes using 512 connections
- 9268k vs 9373k for Plaintext using 1024 connections
a regression from 742k to 723k for JsonPlatform with 20_000 connections. It's a 2.5% regression, so it's small and the two other benchmarks (Plaintext and Fortunes) give the best results for this config so I think that it's acceptable

Very good thing: the throughput of JSON and Fortunes benchmarks rise when the number of clients increases (to some point ofc). We did not have that before.
Another great thing: 417,499 for Fortunes 1024 connections with latest bits from @roji It's top 10 of Fortunes ;)

x64 56 Cores (Mono machine)

With 56 cores having a single epoll thread is not enough. Having two gives us the most optimal solution that is improving all cases.

There are two cases where having more epoll threads gives better results, but all of them are small and ignorable differences within the margin of error:

6954k vs 6950k for Plaintext using 256 connections
6964k vs 6960k for Plaintext using 512 connections

There are two where having less epoll threads gives better results:

ignorable 660k vs 673k for JsonPlatform using 128 connections
6523k vs 6964k for PlaintextPlatform using 128 connections. Having a single epoll thread could give us better results, but we still have an improvement compared to base 6011k. We could reach it by setting the MinHandles to 128 instead of 32, but I don't think that it's worth it - it's rather unlikely that such a beefy machine is going to be used for handling such a small load.

Very nice thing: the gains are really big. Even up to x2 for Json with 512 connections.

The Fortunes benchmark is not included because for some reason this machine can not currently access the db server.

ARM64 32 Cores

Here is where things get complicated:

Having a single epoll thread, no matter what dictionary we use gives us a lot of red color (except the case with 20k connections).

There is no obvious dependency between the number of connections and the number of threads (like the more connections the more threads we need). If we take a look at the numbers before our changes it looks like this machine is struggling to scale up when the number of connections grows (JSON numbers are: 470->455->425->350->246).
This requires an independent investigation.

Using 4 epoll threads gives us more improvements than using two. There is only one regression: JSON using 128 connections. Again, I think that for this number of Cores we should optimize for many connections and I hope that this is acceptable.

adamsitnik · 2020-05-06T18:26:57Z

I've shared the numbers from my most recent experiment in a comment above. PTAL

Based on these numbers I came up with the following proposal for the heuristic that determines the number of epoll threads:

we need one epoll thread for every 28 cores
we need to "round up" in a way that 29 cores get 2 epoll threads
we need to double that for ARM

The code that I've just pushed gives the following results:

ratio is (#35800/before #35330) - 1.0

Machine	Connections	Benchmark	before #35330	#35330	#35800	ratio
Citrine 28 cores	128	PlaintextPlatform	7,274,914	7,389,508	7,753,695	6.58%
		JsonPlatform	728,738	753,185	824,774	13.18%
		FortunesPlatform	288,242	293,637	301,217	4.50%
		Fortunes Batching	169,087	167,237	175,886	4.02%

	256	PlaintextPlatform	8,855,217	8,898,258	8,949,456	1.06%
		JsonPlatform	941,176	952,582	1,078,476	14.59%
		FortunesPlatform	291,339	301,213	334,134	14.69%
		Fortunes Batching	311,945	299,331	336,176	7.77%

	512	PlaintextPlatform	8,785,644	9,139,882	9,251,310	5.30%
		JsonPlatform	919,425	956,259	1,124,823	22.34%
		FortunesPlatform	289,177	302,984	305,273	5.57%
		Fortunes Batching	358,163	349,256	411,341	14.85%

	1,024	PlaintextPlatform	8,798,429	9,093,448	9,329,115	6.03%
		JsonPlatform	917,482	983,014	1,135,564	23.77%
		FortunesPlatform	261,790	273,522	298,583	14.05%
		Fortunes Batching	372,989	374,679	407,914	9.36%

	20,000	PlaintextPlatform	6,711,039	6,707,423	7,251,077	8.05%
		JsonPlatform	742,247	754,620	732,218	-1.35%
		FortunesPlatform	208,385	220,029	227,443	9.15%
		Fortunes Batching	289,530	301,026	347,141	19.90%


Perf 12 cores	128	PlaintextPlatform	4,548,601	4,581,534	4,439,358	-2.40%
		JsonPlatform	438,914	456,929	502,222	14.42%
		FortunesPlatform	120,766	127,799	136,628	13.13%

	256	PlaintextPlatform	4,520,799	4,728,698	5,288,782	16.99%
		JsonPlatform	441,074	464,803	545,610	23.70%
		FortunesPlatform	123,775	132,081	138,271	11.71%

	512	PlaintextPlatform	4,439,709	4,915,243	5,368,813	20.93%
		JsonPlatform	456,198	480,191	554,399	21.53%
		FortunesPlatform	121,289	130,383	129,009	6.36%

	1,024	PlaintextPlatform	4,270,802	4,856,757	5,282,018	23.68%
		JsonPlatform	453,737	480,158	561,559	23.76%
		FortunesPlatform	108,143	118,506	123,213	13.94%

	20,000	PlaintextPlatform	3,886,569	3,960,775	4,039,859	3.94%
		JsonPlatform	309,933	333,290	388,005	25.19%
		FortunesPlatform	94,303	105,309	110,534	17.21%


ARM 32 cores	128	PlaintextPlatform	5,325,320	5,248,309	5,320,333	-0.09%
		JsonPlatform	470,719	467,996	430,931	-8.45%
		FortunesPlatform	70,159	79,601	87,110	24.16%

	256	PlaintextPlatform	5,443,043	5,433,406	5,571,782	2.37%
		JsonPlatform	455,767	420,229	458,268	0.55%
		FortunesPlatform	73,379	76,414	86,376	17.71%

	512	PlaintextPlatform	5,143,935	5,644,389	5,773,530	12.24%
		JsonPlatform	425,086	397,756	454,824	7.00%
		FortunesPlatform	80,027	79,361	84,626	5.75%

	1,024	PlaintextPlatform	5,289,294	5,409,985	5,817,791	9.99%
		JsonPlatform	350,471	376,589	440,122	25.58%
		FortunesPlatform	59,300	53,292	60,414	1.88%

	20,000	PlaintextPlatform	3,799,859	4,109,911	4,229,440	11.31%
		JsonPlatform	246,717	258,675	309,816	25.58%
		FortunesPlatform	44,415	36,242	40,573	-8.65%


Mono 56 cores	128	PlaintextPlatform	6,011,013	6,508,597	6,577,628	9.43%
		JsonPlatform	462,300	673,968	655,741	41.84%

	256	PlaintextPlatform	6,896,236	6,906,699	6,930,573	0.50%
		JsonPlatform	600,973	980,908	1,052,635	75.16%

	512	PlaintextPlatform	6,941,870	6,941,820	6,954,308	0.18%
		JsonPlatform	623,578	1,079,661	1,136,029	82.18%

	1,024	PlaintextPlatform	6,960,810	6,962,596	6,957,445	-0.05%
		JsonPlatform	741,710	1,138,508	1,166,838	57.32%

	20,000	PlaintextPlatform	6,825,034	6,784,191	6,786,858	-0.56%
		JsonPlatform	660,291	919,557	944,349	43.02%

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

tmds · 2020-05-06T20:01:32Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

+
+            // the data that stands behind this heuristic can be found at https://github.com/dotnet/runtime/pull/35800#issuecomment-624719500
+            // the goal is to have a single epoll thread per every 28 cores
+            const int coresPerSingleEpollThread = 28;


The benchmarks confirm an observation we had made from the perftraces. It may interesting to put in the comment:

TechEmpower JSON platform benchmark (which has a low workload per request) shows the epoll thread is fully loaded on a 28-core machine. We add 1 epoll thread per 28 cores to avoid it being a bottleneck.

I can already hear all the complaints about this line ... me being the first concerned. I think we should not have any heuristic with such a value. What if tomorrow we decide to use a different hardware to do benchmarks? We should have some heuristics that are good for the general cases, and allow customer to define custom values that might be better for them. In our case, in the TE repository we would then define an ENV with a number of epoll thread we want. Same for ARM probably, which might depend on each vendor.

I can already hear all the complaints about this line ... me being the first concerned.

I guess you refer to the suggestion I made? I prefer to put it explicitly here than to have it implicit in the linked comment.

From the benchmarking we did, TE platform JSON benchmark represents the lowest threadpool workload per request. This means the epoll thread will sooner become the bottleneck than on the other benchmarks.

We should have some heuristics that are good for the general cases

This heuristic is good for the general case, It uses less epoll threads than the previous heuristic (which was a guess with little benchmarking done) and achieves higher performance.

and allow customer to define custom values that might be better for them.

This is now possible, the count can be set explicitly using the envvar.

The env var is good. But the number 28 within the code is my concern. Unless we say we need to pick one, but 28 because CITRINE should not be the reason IMO.

@adamsitnik, on the 56-core machine (2-socket 28-core) are the numbers with or without the env vars COMPlus_Thread_UseAllCpuGroups=1 and COMPlus_GCCpuGroup=1? Without those I suspect it would only be using one CPU group and behaving as a single-socket 28-core machine, with those it should try to use both sockets. The numbers seem to be similar to the 28-core machine and it's a bit odd that 2 epoll threads do better there, though there may be other things going on.

It might be common to set those env vars on multi-numa-node machines if the intention is to scale up. Might also be interesting to try the AMD machine with those env vars since it also has multiple numa nodes. Not suggesting for this change or anything but it might provide more insights on heuristics for number of epoll threads.

Ahh nevermind, from a brief look it almost looks like all of the the CPU group stuff is disabled on Linux and those env vars may not have any effect. Sorry for the distraction.

It might be common to set those env vars on multi-numa-node machines if the intention is to scale up.

Thanks for pointing this out. I am going to run this config as well.

Pls see my latest comment :) there might be more work to do there in the VM, I'm not up-to-date on what's happening there

I agree with @sebastienros that this value is fairly arbitrary, based on the specific (and limited) hardware we've tested on. Does the heuristic hold up on machines with a similar number of cores but a different distribution across nodes? What about when hyperthreading is disabled? Did we try it with cloud VMs?

We see 1 epoll thread is enough to load a 28 core machine (Citrine) with a benchmark that has low threadpool workload vs epoll workload (TE JSON platform).

That's what is captured by coresPerSingleEpollThread = 28.

This heuristic also works well in the likely cases that

ProcessorCount is lower than 28

for higher workloads per epoll workload

This heuristic isn't tuned for multi-node machines, or machines with 28+++ procs.

tmds · 2020-05-06T20:04:50Z

@stephentoub @adamsitnik I think we can remove the MinHandlesForAdditionalEngine logic. It was there to avoid creating too many epoll threads. Now we have low nr of epoll threads anyway.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

stephentoub · 2020-05-07T14:34:52Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

-            Environment.ProcessorCount >= 6 ? Environment.ProcessorCount / 2 : 1;
-#endif
-#pragma warning restore CA1802
+        private static readonly int s_engineCount = GetEnginesCount();


Nit: should this be s_maxEngineCount? We won't always have this many, but we may grow to this many based on the number of concurrent sockets, right?

@stephentoub I've suggested to remove that logic as part of this PR (#35800 (comment)).

@tmds You are most probably right. The only use case for keeping it is a machine with many cores and very few connections. Which should be uncommon.

Would you prefer me to remove it now or would you like to do this in your upcoming PR that is going to enable the "inlining"?

@adamsitnik remove it here, it is unrelated to inlining.

@tmds I am going to merge it as it is right now as I would really love to see the update numbers. I am going to send a PR with MinHandles logic removal today or tomorrow.

sebastienros · 2020-05-07T14:39:45Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs


-            return Math.Min(result, Environment.ProcessorCount / 2);
+            return Math.Max(1, (int)Math.Round(Environment.ProcessorCount / (double)coresPerEngine));


and "round" it up, in a way that 29 cores gets 2 epoll threads

So now anyting below 44 cores on x64 will get 1 thread? Then 2 after 76 cores ...

Yes and this should be enough for vast majorty of real-life scenarios.

adamsitnik · 2020-05-07T14:54:12Z

TechEmpower is super artificial (super small socket reads and writes & extremely high load) and even under such high load, one engine (producer) is capable of keeping busy up to 30 CPU Cores (8 on ARM). This is possible thanks to the amazing work that @kouvel has done in #35330

In real-life scenario, nobody should ever need more than one epoll thread for the entire app. But we can't predict all possible usages and I believe that these numbers (30 & 8) are safe because it would be super hard to generate more network load.

I've simplified the heuristic and added explanation.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

adamsitnik added 5 commits May 4, 2020 16:24

use struct wrapper for better perf

386e64d

check the most common case first

dc6b496

don't access static variable in a loop

simplify the source code

02092b7

use Span instead of raw pointers

fb1ea27

force inline HandleSyncEventsSpeculatively

c8c3fba

adamsitnik added area-System.Net.Sockets os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels May 4, 2020

adamsitnik requested review from tmds, stephentoub and kouvel May 4, 2020 17:11

kouvel reviewed May 4, 2020

View reviewed changes

stephentoub reviewed May 4, 2020

View reviewed changes

adamsitnik added 3 commits May 5, 2020 17:22

use a Dictionary under a lock instead of ConcurrentDictionary, it imp…

3dd7784

…roves the perf for ARM and for scenarios with MANY clients

address code review feedback

b4c61ae

swtich to a single epoll thread by default

240b5c4

adamsitnik changed the title ~~Epoll thread event loop micro optimizations~~ Single epoll thread May 5, 2020

adamsitnik added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 5, 2020

stephentoub reviewed May 5, 2020

View reviewed changes

code review fixes

3ad2e28

apply code review suggestions

fe10f44

change the heuristic, single epoll thread is not always enough

d85744a

adamsitnik removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 6, 2020

adamsitnik changed the title ~~Single epoll thread~~ Single epoll thread per 28 cores May 6, 2020

tmds reviewed May 6, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Outdated Show resolved Hide resolved

tmds reviewed May 6, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Outdated Show resolved Hide resolved

tmds reviewed May 6, 2020

View reviewed changes

stephentoub reviewed May 7, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Outdated Show resolved Hide resolved

adamsitnik mentioned this pull request May 7, 2020

Regression in non-pipelined Platform benchmarks #33669

Closed

simplify the heuristic and add a comment

6b45dda

stephentoub reviewed May 7, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Outdated Show resolved Hide resolved

stephentoub reviewed May 7, 2020

View reviewed changes

stephentoub approved these changes May 7, 2020

View reviewed changes

sebastienros reviewed May 7, 2020

View reviewed changes

adamsitnik added 2 commits May 7, 2020 17:05

apply the naming suggestions

f79e0b6

Merge remote-tracking branch 'upstream/master' into epollThread

5cc2452

tmds reviewed May 7, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Show resolved Hide resolved

adamsitnik merged commit 88a78d2 into dotnet:master May 7, 2020

adamsitnik deleted the epollThread branch May 7, 2020 18:30

adamsitnik mentioned this pull request May 14, 2020

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

Open

antonfirsov mentioned this pull request Jun 2, 2020

Move epoll event handling to a non-inlined method #37138

Merged

geoffkizer mentioned this pull request Jul 30, 2020

Linux sockets: Consider removing multi-threading support in SocketAsyncEngine #24138

Closed

adamsitnik mentioned this pull request Aug 4, 2020

Performance regressions on EF scenarios #36292

Closed

karelz added this to the 5.0.0 milestone Aug 18, 2020

stephentoub mentioned this pull request Sep 3, 2020

[Perf -26%] System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get #39116

Open

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020


		return Math.Min(result, Environment.ProcessorCount / 2);
		return Math.Max(1, (int)Math.Round(Environment.ProcessorCount / (double)coresPerEngine));

Single epoll thread per 28 cores #35800

Single epoll thread per 28 cores #35800

Conversation

adamsitnik commented May 4, 2020 • edited Loading

ghost commented May 4, 2020

tmds commented May 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmds commented May 4, 2020

kouvel commented May 4, 2020

kouvel commented May 4, 2020

kouvel commented May 4, 2020

kouvel commented May 4, 2020

adamsitnik commented May 4, 2020

adamsitnik commented May 4, 2020

Choose a reason for hiding this comment

adamsitnik commented May 5, 2020

adamsitnik commented May 6, 2020 • edited Loading

How to read the results

x64 12 Cores (the perf machine)

x64 28 Cores (Citrine, the TechEmpower machine)

x64 56 Cores (Mono machine)

ARM64 32 Cores

adamsitnik commented May 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmds May 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouvel May 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmds commented May 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamsitnik commented May 7, 2020

adamsitnik commented May 4, 2020 •

edited

Loading

adamsitnik commented May 6, 2020 •

edited

Loading

x64 12 Cores (the `perf` machine)

tmds May 6, 2020 •

edited

Loading

kouvel May 7, 2020 •

edited

Loading

tmds commented May 6, 2020 •

edited

Loading