System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

Rodrigo-Andrade · 2020-11-10T07:55:59Z

This is more like a loose question, feel free to close it if its too off-topic.

For sometime i always see System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() as a hotspot when profiling my code (a reverse proxy, we do plan to move to YARP once its more mature).

Its always something like:

I see this on a 8 core VM, both in Windows and Linux. No thread starvation or something of the sort.
Since that method seems to call to native code, i have no clue to why this overhead is so big.

Anything on the matter will as always will be greatly appreciated.

The text was updated successfully, but these errors were encountered:

EgorBo · 2020-11-10T10:54:24Z

Most likely unrelated, but I noticed that EnsureThreadRequested has a few minor codegen issues, here is a simplified repro:

private int field;

bool EnsureThreadRequested(int count)
{
    int prev = Interlocked.CompareExchange(ref field, count + 1, count);
    return prev == count;
}

Current codegen (RyuJIT, Windows x64):

cmp      dword ptr [rcx], ecx  ; <-- redundant nullcheck, see #44087
add      rcx, 8
lea      r8d, [rdx+1]
mov      eax, edx
lock     
cmpxchg  dword ptr [rcx], r8d ; <-- cmpxchg modifes EFLAGS so we can get rid of the following cmp
cmp      eax, edx
sete     al
movzx    rax, al
ret

Codegen on Mono-LLVM (Linux x64 ABI, AT&T):

push   %rax
mov    %esi,%eax
lea    0x1(%rax),%ecx
lock cmpxchg %ecx,0x10(%rdi)
sete   %al
pop    %rcx
retq

mangod9 · 2020-11-12T18:50:07Z

fyi @kouvel

kouvel · 2020-11-13T18:19:34Z

In short bursty workloads the thread pool currently releases too many threads too quickly. It's a known issue, fixing it would involve some tradeoffs but probably would be a better tradeoff. I'm hoping to look into this for .NET 6.

Rodrigo-Andrade · 2020-11-14T04:15:36Z

@kouvel any workarounds for now?

kouvel · 2020-11-17T01:09:20Z

Are you seeing this in 3.1 or 5? In .NET 5 there has been some improvement to time spent there in async socket operations, though it doesn't fix the root cause. I'm not aware of any other workarounds that would reduce the CPU time there.

Rodrigo-Andrade · 2020-11-17T09:56:31Z

Are you seeing this in 3.1 or 5?

Its dotnet5.

Using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS envvar and UnsafePreferInlineScheduling on kestrel transport greatly alleviated this (doubled the app performance). I guess its because this makes the app enqueue on the thread pool much less. But its a bit surprising this overhead has such a great impact (on my app at least).

kouvel · 2020-11-17T16:25:19Z

I see, yea that would probably use the thread pool a lot less. The overhead probably means that the thread pool queue is remaining a relatively short length such that the active worker thread count remains below the processor count, and since the thread pool awakens more threads currently, it ends up taking the slower path to wake up a thread more frequently. The fix I have in mind would reduce that overhead but it wouldn't eliminate it.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

Rodrigo-Andrade · 2020-11-18T07:01:03Z

When using dotnet-counters, the thread count on the pool was a stable 8 (on a 8 vcore machine), seldom going to 9. Unless the threads would go up and down so fast that the counter was no picking this up.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

It doubled throughput, I am running with less than half machines now.

I had some stalling issues, but I will probably report this on another issue.

kouvel · 2020-11-18T15:01:23Z

The thread count in counters is just the number of threads that have been created, there may be fewer that are active on average.

It doubled throughput, I am running with less than half machines now.

I see. I'm curious how close the prototype fix would get. Would you be able to try out a prototype fix? If so I can send you a patched runtime and some instructions on using it.

Rodrigo-Andrade · 2020-11-18T19:32:40Z

Sure!

kouvel · 2020-11-18T19:35:38Z

Which OS would you prefer for testing it?

kouvel · 2020-11-18T19:37:24Z

Btw DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS is only applicable to non-Windows platforms. I believe UnsafePreferInlineScheduling would work on Windows too, but I'm guessing the results above were on Linux.

Rodrigo-Andrade · 2020-11-18T21:43:20Z

its running on amazon linux on a docker container using the mcr.microsoft.com/dotnet/aspnet:5.0.0 image

Rodrigo-Andrade · 2020-12-03T16:04:56Z

@kouvel as we are moving more apps to net5.0 the overhead is even more pronounced (i guess the changes to SocketAsyncEngine is to blame) making the move from 3.1 to 5 yield no performance improvements, unless we use the inline options, but those are dangerous, so we can't enable them on all our apps.

Looks like the main problem is the fact that we run our apps alone (no other apps on the VM) with low cpu (15-25%). When we inline everything on the SocketAsyncEngine, the performance skyrockets.

Maybe an easy fix was to be able to specify the minimum thread number? Since its one app per VM, having the minimum of one thread per core would be more efficient (i know that even then, less threads would be better for cache locality, but the overhead of waking up the threads offsets this by a big margin)

kouvel · 2020-12-03T19:56:26Z

I'm about to build a prototype fix for you to try (sorry for the delay), couple of questions about that:

Which processor architecture are you using, x64/arm/arm64?
For the app are you using a self-contained publish, or using the shared framework from the container image?

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

we run our apps alone (no other apps on the VM) with low cpu (15-25%)
Maybe an easy fix was to be able to specify the minimum thread number?

I see. How many procs are assigned to the VM/container? It's possible to configure the thread pool's min worker thread count to lower than the proc count. Set the following env vars before starting the app in the same shell:

export COMPlus_ThreadPool_ForceMinWorkerThreads=1
export COMPlus_HillClimbing_Disable=1

Disabling hill climbing as well since it doesn't work very well at low thread counts at the moment. Also if it matters note that the thread count specified above is treated as hexadecimal.

Rodrigo-Andrade · 2020-12-03T21:38:14Z

Which processor architecture are you using, x64/arm/arm64?

x64

For the app are you using a self-contained publish, or using the shared framework from the container image?

shared framework from the container image

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

For cpu intensive apps it's about the same, for IO intensive, its worse.

How many procs are assigned to the VM/container?

Just the dotnet process

It's possible to configure the thread pool's min worker thread count to lower than the proc count.
Very nice! I'll try those new settings.

kouvel · 2020-12-03T21:40:56Z

Thanks @Rodrigo-Andrade,

How many procs are assigned to the VM/container?

I meant how many processors are assigned to the VM/container? Wondering if it is already limited to a few processors or if it has several available and only the CPU utilization is limited.

Rodrigo-Andrade · 2020-12-03T21:45:29Z

8 vcpus

kouvel · 2020-12-04T00:24:43Z

Here is a link to a patched System.Private.CoreLib.dll with a prototype fix: RequestFewerThreads50

To try it out, since you're using a container it probably would be easiest to temporarily replace that file in the shared framework in a temporary modification of the image. Find the path to dotnet in the container image, the original file should be in <pathToDotnet>/shared/Microsoft.NETCore.App/5.0.0. Replace that file and try out your scenarios. Also make sure to revert the changes to restore the original container image, since this is just a locally built patch and would not be supported.

You can try with and without the config vars above, though the prototype fix may not help much with the above config vars set. I'm hoping to at least see some reduction in the overhead in and under EnsureThreadRequested, not sure if perf would also increase though. Let me know!

Rodrigo-Andrade · 2020-12-04T15:39:45Z

I'll play with it this weekend, thank you!

Rodrigo-Andrade · 2020-12-07T19:05:20Z

@kouvel just a preliminary result:

The orange line is the current app, dotnet 5 not patched using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS and kestrel DangerousInlinePipeScheduling which avoids most ThreadPool usage (15%-18% cpu usage).

The blue line is the experiment.
The first mound is your patch without any parameters (40%-44% cpu usage).
The second mound is not patched without any parameters (stable in 50% cpu usage).

They all have the same request/s.

So it's an improvement.

I'll try to get some traces soon.

kouvel · 2020-12-08T19:47:47Z

Thanks @Rodrigo-Andrade! It's a bit difficult to compare the blue to the orange line because the work done is a fair bit different, as the inline-completions mode bypasses pipelining and a bunch of other work that makes it unreliable, so there's more work being done in the blue line. Still though, it looks like there is room for improvement, curious to see how much the overhead under EnsureThreadRequested has decreased with the patch.

Might also want to try with export COMPlus_HillClimbing_Disable=1 and without configuring the worker thread counts. Hill climbing can add a lot of threads unnecessarily and frequently, oversaturating the container, disabling it might have a noticeable effect in CPU-limited cases.

Rodrigo-Andrade · 2021-01-11T13:26:18Z

Sorry for taking so long.

Tested COMPlus_HillClimbing_Disable and it did not make a difference.

Here is some numbers that i got:

patched:

The impact in production is greater then this trace suggests.

kouvel · 2021-01-12T23:31:48Z

Thanks @Rodrigo-Andrade. Since after the patch RequestWorkerThread would typically only be called 2 times per batch of work, it looks like the workload is very bursty with short bursts and probably the additional path length of going through the thread pool and pipelining is significant relative to the other work being done. There may be ways to optimize those paths better, especially for CPU-constrained systems but that would probably be investigation for 6.0.

Rodrigo-Andrade added the tenet-performance Performance related issue label Nov 10, 2020

Dotnet-GitSync-Bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels Nov 10, 2020

Rodrigo-Andrade changed the title ~~System.Threading.ThreadPoolWorkQueue.ThreadPoolWorkQueue() hotspot~~ System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot Nov 10, 2020

mangod9 added question Answer questions and provide assistance, not an issue with source code or documentation. and removed untriaged New issue has not been triaged by the area owner labels Nov 12, 2020

mangod9 added this to the Future milestone Nov 12, 2020

kouvel modified the milestones: Future, 6.0.0 Nov 13, 2020

kouvel removed the question Answer questions and provide assistance, not an issue with source code or documentation. label Nov 13, 2020

Rodrigo-Andrade mentioned this issue Dec 3, 2020

Issues when using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS and DangerousInlinePipeScheduling dotnet/aspnetcore#28356

Closed

kouvel modified the milestones: 6.0.0, 7.0.0 Jul 27, 2021

Rodrigo-Andrade mentioned this issue Sep 17, 2021

DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS and UnsafePreferInlineScheduling Support #59157

Open

mangod9 modified the milestones: 7.0.0, Future Jul 19, 2022

kouvel mentioned this issue Oct 4, 2023

Reduce the number of thread requests further in the thread pool and from async IO completion dispatchers #93028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

Rodrigo-Andrade commented Nov 10, 2020 •

edited

Loading

EgorBo commented Nov 10, 2020 •

edited

Loading

mangod9 commented Nov 12, 2020

kouvel commented Nov 13, 2020

Rodrigo-Andrade commented Nov 14, 2020

kouvel commented Nov 17, 2020

Rodrigo-Andrade commented Nov 17, 2020

kouvel commented Nov 17, 2020

Rodrigo-Andrade commented Nov 18, 2020

kouvel commented Nov 18, 2020

Rodrigo-Andrade commented Nov 18, 2020

kouvel commented Nov 18, 2020

kouvel commented Nov 18, 2020

Rodrigo-Andrade commented Nov 18, 2020

Rodrigo-Andrade commented Dec 3, 2020 •

edited

Loading

kouvel commented Dec 3, 2020 •

edited

Loading

Rodrigo-Andrade commented Dec 3, 2020

kouvel commented Dec 3, 2020

Rodrigo-Andrade commented Dec 3, 2020

kouvel commented Dec 4, 2020

Rodrigo-Andrade commented Dec 4, 2020

Rodrigo-Andrade commented Dec 7, 2020 •

edited

Loading

kouvel commented Dec 8, 2020 •

edited

Loading

Rodrigo-Andrade commented Jan 11, 2021 •

edited

Loading

kouvel commented Jan 12, 2021 •

edited

Loading

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

Comments

Rodrigo-Andrade commented Nov 10, 2020 • edited Loading

EgorBo commented Nov 10, 2020 • edited Loading

mangod9 commented Nov 12, 2020

kouvel commented Nov 13, 2020

Rodrigo-Andrade commented Nov 14, 2020

kouvel commented Nov 17, 2020

Rodrigo-Andrade commented Nov 17, 2020

kouvel commented Nov 17, 2020

Rodrigo-Andrade commented Nov 18, 2020

kouvel commented Nov 18, 2020

Rodrigo-Andrade commented Nov 18, 2020

kouvel commented Nov 18, 2020

kouvel commented Nov 18, 2020

Rodrigo-Andrade commented Nov 18, 2020

Rodrigo-Andrade commented Dec 3, 2020 • edited Loading

kouvel commented Dec 3, 2020 • edited Loading

Rodrigo-Andrade commented Dec 3, 2020

kouvel commented Dec 3, 2020

Rodrigo-Andrade commented Dec 3, 2020

kouvel commented Dec 4, 2020

Rodrigo-Andrade commented Dec 4, 2020

Rodrigo-Andrade commented Dec 7, 2020 • edited Loading

kouvel commented Dec 8, 2020 • edited Loading

Rodrigo-Andrade commented Jan 11, 2021 • edited Loading

kouvel commented Jan 12, 2021 • edited Loading

Rodrigo-Andrade commented Nov 10, 2020 •

edited

Loading

EgorBo commented Nov 10, 2020 •

edited

Loading

Rodrigo-Andrade commented Dec 3, 2020 •

edited

Loading

kouvel commented Dec 3, 2020 •

edited

Loading

Rodrigo-Andrade commented Dec 7, 2020 •

edited

Loading

kouvel commented Dec 8, 2020 •

edited

Loading

Rodrigo-Andrade commented Jan 11, 2021 •

edited

Loading

kouvel commented Jan 12, 2021 •

edited

Loading