-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449
Comments
Most likely unrelated, but I noticed that private int field;
bool EnsureThreadRequested(int count)
{
int prev = Interlocked.CompareExchange(ref field, count + 1, count);
return prev == count;
} Current codegen (RyuJIT, Windows x64): cmp dword ptr [rcx], ecx ; <-- redundant nullcheck, see #44087
add rcx, 8
lea r8d, [rdx+1]
mov eax, edx
lock
cmpxchg dword ptr [rcx], r8d ; <-- cmpxchg modifes EFLAGS so we can get rid of the following cmp
cmp eax, edx
sete al
movzx rax, al
ret Codegen on Mono-LLVM (Linux x64 ABI, AT&T): push %rax
mov %esi,%eax
lea 0x1(%rax),%ecx
lock cmpxchg %ecx,0x10(%rdi)
sete %al
pop %rcx
retq |
fyi @kouvel |
In short bursty workloads the thread pool currently releases too many threads too quickly. It's a known issue, fixing it would involve some tradeoffs but probably would be a better tradeoff. I'm hoping to look into this for .NET 6. |
@kouvel any workarounds for now? |
Are you seeing this in 3.1 or 5? In .NET 5 there has been some improvement to time spent there in async socket operations, though it doesn't fix the root cause. I'm not aware of any other workarounds that would reduce the CPU time there. |
Its dotnet5. Using |
I see, yea that would probably use the thread pool a lot less. The overhead probably means that the thread pool queue is remaining a relatively short length such that the active worker thread count remains below the processor count, and since the thread pool awakens more threads currently, it ends up taking the slower path to wake up a thread more frequently. The fix I have in mind would reduce that overhead but it wouldn't eliminate it. Are you seeing better performance with those config options too, or is it just a CPU time issue? |
When using dotnet-counters, the thread count on the pool was a stable 8 (on a 8 vcore machine), seldom going to 9. Unless the threads would go up and down so fast that the counter was no picking this up.
It doubled throughput, I am running with less than half machines now. I had some stalling issues, but I will probably report this on another issue. |
The thread count in counters is just the number of threads that have been created, there may be fewer that are active on average.
I see. I'm curious how close the prototype fix would get. Would you be able to try out a prototype fix? If so I can send you a patched runtime and some instructions on using it. |
Sure! |
Which OS would you prefer for testing it? |
Btw |
its running on amazon linux on a docker container using the mcr.microsoft.com/dotnet/aspnet:5.0.0 image |
@kouvel as we are moving more apps to net5.0 the overhead is even more pronounced (i guess the changes to Looks like the main problem is the fact that we run our apps alone (no other apps on the VM) with low cpu (15-25%). When we inline everything on the Maybe an easy fix was to be able to specify the minimum thread number? Since its one app per VM, having the minimum of one thread per core would be more efficient (i know that even then, less threads would be better for cache locality, but the overhead of waking up the threads offsets this by a big margin) |
I'm about to build a prototype fix for you to try (sorry for the delay), couple of questions about that:
Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?
I see. How many procs are assigned to the VM/container? It's possible to configure the thread pool's min worker thread count to lower than the proc count. Set the following env vars before starting the app in the same shell: export COMPlus_ThreadPool_ForceMinWorkerThreads=1
export COMPlus_HillClimbing_Disable=1 Disabling hill climbing as well since it doesn't work very well at low thread counts at the moment. Also if it matters note that the thread count specified above is treated as hexadecimal. |
x64
shared framework from the container image
For cpu intensive apps it's about the same, for IO intensive, its worse.
Just the dotnet process
|
Thanks @Rodrigo-Andrade,
I meant how many processors are assigned to the VM/container? Wondering if it is already limited to a few processors or if it has several available and only the CPU utilization is limited. |
8 vcpus |
Here is a link to a patched System.Private.CoreLib.dll with a prototype fix: RequestFewerThreads50 To try it out, since you're using a container it probably would be easiest to temporarily replace that file in the shared framework in a temporary modification of the image. Find the path to You can try with and without the config vars above, though the prototype fix may not help much with the above config vars set. I'm hoping to at least see some reduction in the overhead in and under |
I'll play with it this weekend, thank you! |
@kouvel just a preliminary result: The orange line is the current app, dotnet 5 not patched using The blue line is the experiment. They all have the same request/s. So it's an improvement. I'll try to get some traces soon. |
Thanks @Rodrigo-Andrade! It's a bit difficult to compare the blue to the orange line because the work done is a fair bit different, as the inline-completions mode bypasses pipelining and a bunch of other work that makes it unreliable, so there's more work being done in the blue line. Still though, it looks like there is room for improvement, curious to see how much the overhead under Might also want to try with |
Thanks @Rodrigo-Andrade. Since after the patch |
This is more like a loose question, feel free to close it if its too off-topic.
For sometime i always see
System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested()
as a hotspot when profiling my code (a reverse proxy, we do plan to move to YARP once its more mature).Its always something like:
I see this on a 8 core VM, both in Windows and Linux. No thread starvation or something of the sort.
Since that method seems to call to native code, i have no clue to why this overhead is so big.
Anything on the matter will as always will be greatly appreciated.
The text was updated successfully, but these errors were encountered: