Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

Open
Rodrigo-Andrade opened this issue Nov 10, 2020 · 24 comments
Open
Labels
Milestone

Comments

@Rodrigo-Andrade
Copy link

Rodrigo-Andrade commented Nov 10, 2020

This is more like a loose question, feel free to close it if its too off-topic.

For sometime i always see System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() as a hotspot when profiling my code (a reverse proxy, we do plan to move to YARP once its more mature).

Its always something like:

image

I see this on a 8 core VM, both in Windows and Linux. No thread starvation or something of the sort.
Since that method seems to call to native code, i have no clue to why this overhead is so big.

Anything on the matter will as always will be greatly appreciated.

@Rodrigo-Andrade Rodrigo-Andrade added the tenet-performance Performance related issue label Nov 10, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels Nov 10, 2020
@Rodrigo-Andrade Rodrigo-Andrade changed the title System.Threading.ThreadPoolWorkQueue.ThreadPoolWorkQueue() hotspot System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot Nov 10, 2020
@EgorBo
Copy link
Member

EgorBo commented Nov 10, 2020

Most likely unrelated, but I noticed that EnsureThreadRequested has a few minor codegen issues, here is a simplified repro:

private int field;

bool EnsureThreadRequested(int count)
{
    int prev = Interlocked.CompareExchange(ref field, count + 1, count);
    return prev == count;
}

Current codegen (RyuJIT, Windows x64):

cmp      dword ptr [rcx], ecx  ; <-- redundant nullcheck, see #44087
add      rcx, 8
lea      r8d, [rdx+1]
mov      eax, edx
lock     
cmpxchg  dword ptr [rcx], r8d ; <-- cmpxchg modifes EFLAGS so we can get rid of the following cmp
cmp      eax, edx
sete     al
movzx    rax, al
ret 

Codegen on Mono-LLVM (Linux x64 ABI, AT&T):

push   %rax
mov    %esi,%eax
lea    0x1(%rax),%ecx
lock cmpxchg %ecx,0x10(%rdi)
sete   %al
pop    %rcx
retq

@mangod9 mangod9 added question Answer questions and provide assistance, not an issue with source code or documentation. and removed untriaged New issue has not been triaged by the area owner labels Nov 12, 2020
@mangod9 mangod9 added this to the Future milestone Nov 12, 2020
@mangod9
Copy link
Member

mangod9 commented Nov 12, 2020

fyi @kouvel

@kouvel
Copy link
Member

kouvel commented Nov 13, 2020

In short bursty workloads the thread pool currently releases too many threads too quickly. It's a known issue, fixing it would involve some tradeoffs but probably would be a better tradeoff. I'm hoping to look into this for .NET 6.

@kouvel kouvel modified the milestones: Future, 6.0.0 Nov 13, 2020
@kouvel kouvel removed the question Answer questions and provide assistance, not an issue with source code or documentation. label Nov 13, 2020
@Rodrigo-Andrade
Copy link
Author

@kouvel any workarounds for now?

@kouvel
Copy link
Member

kouvel commented Nov 17, 2020

Are you seeing this in 3.1 or 5? In .NET 5 there has been some improvement to time spent there in async socket operations, though it doesn't fix the root cause. I'm not aware of any other workarounds that would reduce the CPU time there.

@Rodrigo-Andrade
Copy link
Author

Are you seeing this in 3.1 or 5?

Its dotnet5.

Using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS envvar and UnsafePreferInlineScheduling on kestrel transport greatly alleviated this (doubled the app performance). I guess its because this makes the app enqueue on the thread pool much less. But its a bit surprising this overhead has such a great impact (on my app at least).

@kouvel
Copy link
Member

kouvel commented Nov 17, 2020

I see, yea that would probably use the thread pool a lot less. The overhead probably means that the thread pool queue is remaining a relatively short length such that the active worker thread count remains below the processor count, and since the thread pool awakens more threads currently, it ends up taking the slower path to wake up a thread more frequently. The fix I have in mind would reduce that overhead but it wouldn't eliminate it.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

@Rodrigo-Andrade
Copy link
Author

When using dotnet-counters, the thread count on the pool was a stable 8 (on a 8 vcore machine), seldom going to 9. Unless the threads would go up and down so fast that the counter was no picking this up.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

It doubled throughput, I am running with less than half machines now.

I had some stalling issues, but I will probably report this on another issue.

@kouvel
Copy link
Member

kouvel commented Nov 18, 2020

The thread count in counters is just the number of threads that have been created, there may be fewer that are active on average.

It doubled throughput, I am running with less than half machines now.

I see. I'm curious how close the prototype fix would get. Would you be able to try out a prototype fix? If so I can send you a patched runtime and some instructions on using it.

@Rodrigo-Andrade
Copy link
Author

Sure!

@kouvel
Copy link
Member

kouvel commented Nov 18, 2020

Which OS would you prefer for testing it?

@kouvel
Copy link
Member

kouvel commented Nov 18, 2020

Btw DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS is only applicable to non-Windows platforms. I believe UnsafePreferInlineScheduling would work on Windows too, but I'm guessing the results above were on Linux.

@Rodrigo-Andrade
Copy link
Author

its running on amazon linux on a docker container using the mcr.microsoft.com/dotnet/aspnet:5.0.0 image

@Rodrigo-Andrade
Copy link
Author

Rodrigo-Andrade commented Dec 3, 2020

@kouvel as we are moving more apps to net5.0 the overhead is even more pronounced (i guess the changes to SocketAsyncEngine is to blame) making the move from 3.1 to 5 yield no performance improvements, unless we use the inline options, but those are dangerous, so we can't enable them on all our apps.

Looks like the main problem is the fact that we run our apps alone (no other apps on the VM) with low cpu (15-25%). When we inline everything on the SocketAsyncEngine, the performance skyrockets.

Maybe an easy fix was to be able to specify the minimum thread number? Since its one app per VM, having the minimum of one thread per core would be more efficient (i know that even then, less threads would be better for cache locality, but the overhead of waking up the threads offsets this by a big margin)

@kouvel
Copy link
Member

kouvel commented Dec 3, 2020

I'm about to build a prototype fix for you to try (sorry for the delay), couple of questions about that:

  • Which processor architecture are you using, x64/arm/arm64?
  • For the app are you using a self-contained publish, or using the shared framework from the container image?

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

we run our apps alone (no other apps on the VM) with low cpu (15-25%)
Maybe an easy fix was to be able to specify the minimum thread number?

I see. How many procs are assigned to the VM/container? It's possible to configure the thread pool's min worker thread count to lower than the proc count. Set the following env vars before starting the app in the same shell:

export COMPlus_ThreadPool_ForceMinWorkerThreads=1
export COMPlus_HillClimbing_Disable=1

Disabling hill climbing as well since it doesn't work very well at low thread counts at the moment. Also if it matters note that the thread count specified above is treated as hexadecimal.

@Rodrigo-Andrade
Copy link
Author

Which processor architecture are you using, x64/arm/arm64?

x64

For the app are you using a self-contained publish, or using the shared framework from the container image?

shared framework from the container image

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

For cpu intensive apps it's about the same, for IO intensive, its worse.

How many procs are assigned to the VM/container?

Just the dotnet process

It's possible to configure the thread pool's min worker thread count to lower than the proc count.
Very nice! I'll try those new settings.

@kouvel
Copy link
Member

kouvel commented Dec 3, 2020

Thanks @Rodrigo-Andrade,

How many procs are assigned to the VM/container?

I meant how many processors are assigned to the VM/container? Wondering if it is already limited to a few processors or if it has several available and only the CPU utilization is limited.

@Rodrigo-Andrade
Copy link
Author

8 vcpus

@kouvel
Copy link
Member

kouvel commented Dec 4, 2020

Here is a link to a patched System.Private.CoreLib.dll with a prototype fix: RequestFewerThreads50

To try it out, since you're using a container it probably would be easiest to temporarily replace that file in the shared framework in a temporary modification of the image. Find the path to dotnet in the container image, the original file should be in <pathToDotnet>/shared/Microsoft.NETCore.App/5.0.0. Replace that file and try out your scenarios. Also make sure to revert the changes to restore the original container image, since this is just a locally built patch and would not be supported.

You can try with and without the config vars above, though the prototype fix may not help much with the above config vars set. I'm hoping to at least see some reduction in the overhead in and under EnsureThreadRequested, not sure if perf would also increase though. Let me know!

@Rodrigo-Andrade
Copy link
Author

I'll play with it this weekend, thank you!

@Rodrigo-Andrade
Copy link
Author

Rodrigo-Andrade commented Dec 7, 2020

@kouvel just a preliminary result:
image

The orange line is the current app, dotnet 5 not patched using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS and kestrel DangerousInlinePipeScheduling which avoids most ThreadPool usage (15%-18% cpu usage).

The blue line is the experiment.
The first mound is your patch without any parameters (40%-44% cpu usage).
The second mound is not patched without any parameters (stable in 50% cpu usage).

They all have the same request/s.

So it's an improvement.

I'll try to get some traces soon.

@kouvel
Copy link
Member

kouvel commented Dec 8, 2020

Thanks @Rodrigo-Andrade! It's a bit difficult to compare the blue to the orange line because the work done is a fair bit different, as the inline-completions mode bypasses pipelining and a bunch of other work that makes it unreliable, so there's more work being done in the blue line. Still though, it looks like there is room for improvement, curious to see how much the overhead under EnsureThreadRequested has decreased with the patch.

Might also want to try with export COMPlus_HillClimbing_Disable=1 and without configuring the worker thread counts. Hill climbing can add a lot of threads unnecessarily and frequently, oversaturating the container, disabling it might have a noticeable effect in CPU-limited cases.

@Rodrigo-Andrade
Copy link
Author

Rodrigo-Andrade commented Jan 11, 2021

Sorry for taking so long.

Tested COMPlus_HillClimbing_Disable and it did not make a difference.

Here is some numbers that i got:

image

patched:
image

The impact in production is greater then this trace suggests.

@kouvel
Copy link
Member

kouvel commented Jan 12, 2021

Thanks @Rodrigo-Andrade. Since after the patch RequestWorkerThread would typically only be called 2 times per batch of work, it looks like the workload is very bursty with short bursts and probably the additional path length of going through the thread pool and pipelining is significant relative to the other work being done. There may be ways to optimize those paths better, especially for CPU-constrained systems but that would probably be investigation for 6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants