-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in non-pipelined Platform benchmarks #33669
Comments
@halter73 could you shortly describe how the non-pipelined platform code works? What I am interested in is who spawns the task for the new connection and for how long is it executed? I can see that it's configured here: And the actual work is done here: |
Kestrel's ConnectionDispatcher is executing the ConnectionDelegate defined by HttpApplication.ExecuteAsync which is defined in the Benchmarks repo. This is what ultimately calls BenchmarkApplication.ExecuteAsync. Does that answer your question? |
Yes, thank you very much! I have one more question left: do I understand correctly that I am trying to understand why after my change the runtime/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs Lines 435 to 445 in 3bb5f14
|
Correct.
Json resulting in more dispatching than pipelined plaintext makes sense since there are more requests per read in the pipelined plaintext case. I'm not sure way platform is dispatching more than middleware though. Middleware is basically built on top of the platform, so that doesn't make much sense to me yet. |
Side note, Plaintext NP is now slower than Json (NP). There might be something to dig here too. You might want to create a brand new Plaintext and Json app, completely independent from the current Benchmarks app to be sure we are really comparing the Json serialization as the single change. |
Minor update on what I've learned so far: The main difference between platform and non-platform benchmarks is that which internally schedules a job on
While the which in case of TechEmpower is always executing the fast path - the write is non-blocking and nothing is added to the socket queue (it's very small write, 131 bytes for JSON platform benchmark): runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Lines 1781 to 1782 in 90ab57c
Plaintext (the default, pipelined version) has not regressed (on this hardrware) because with pipelining enabled the flushing is much less frequent (16 times with current settings). The problem is gone when we replace It still does not answer the question of why replacing a few @stephentoub @kouvel have you ever faced a similar problem? |
A possibility could be that more runtime/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs Lines 435 to 445 in 5c6a91d
One There is also a possibility the |
Oops copied the wrong lines above, I meant this (edited above as well): runtime/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs Lines 435 to 445 in 5c6a91d
|
This great insight made me try reducing the number of min and max threads in ThreadPool. When I set the values to Without it, there are on average
@kouvel what conditions need to be meet by |
Hill climbing probably is the main cause of the additional threads. It monitors throughput of work items and occasionally tries a higher or lower thread count (with proc count threads as minimum by default). If it sees a correlation between increasing thread count and increasing throughput, then it settles on a higher thread count and tries again from there. Hill climbing takes CPU utilization of the process into account but the current thresholds are pretty high (does not increase thread count if process CPU utilization is >= 95% over ~500 ms). The 13 threads that are barely used are probably from spikes in thread count changes from hill climbing, given how little they are used they wouldn't be contributing much to throughput in either direction. Could you try with From a brief look, hill climbing may not be taking into account active requests for threads. In bursty cases like this, work item throughput would drop frequently just because the thread pool is out of work, and hill climbing may be reacting to this by increasing thread count. There are thread count change events ( Also may help to allow hill climbing to go below proc count threads by default, but from experiments so far it doesn't look like it works very well for that, needs more investigation.
Still looks like it helps to decrease time in that method. The PAL's full wait-for-semaphore is a bit slow and an earlier sleep could make the spin-waiting less intrusive without taking the full expense of waiting on the semaphore. I'll do some experiments. |
With this setting we have 33 threads in ThreadPool, however it's still "too many" and the RPS is still not good enough
|
How much effect does it have on RPS? It sounds like starving the |
There may be better solutions. Ideally in the current system I would like to see that one |
@kouvel as agreed offline, I've prepared a repro. First of all, we need a modified version of This env var must be set to some value using the BenchmarkDriver console line argument. Example: The second thing is a modified JSON Platform benchmark that allows for setting the maximum numer of threads in Thread Pool: This is configurable via optional I've created a new branch of my fork (with a copy of modified git clone https://github.com/adamsitnik/Benchmarks-1.git repro
cd repro
git checkout threadPoolRepro
cd src\BenchmarksDriver
dotnet run -- --jobs ..\BenchmarksApps\Kestrel\PlatformBenchmarks\benchmarks.json.json --scenario "JsonPlatform" --server "$secret1" --client "$secret2" --display-output --collect-trace --output-file ".\System.Net.Sockets.dll" --env "MinHandles=32" With the default settings from above (256 connections, create epoll thread for every 32 connections, don't touch ThreadPool) I am getting an RPS of The best config that I was able to find was: --env "MinHandles=120" --env "maxThreadCount=24" Which gives a very impressive I am going to send you an email with both trace files and the secret names of the machines. |
Edit: important note: the machine has 28 cores (14 physical) |
Thanks Adam. I found the |
@adamsitnik, would you be able to summarize the changes in your |
I have no idea how I could provide a link to a wrong repo... Apologies for that!
To tell the long story short, I am planning to create a very simple kestrel transport that has fewer features but is much faster due to lack of the extra overhead. I've just sent a PR with hopefully a final version of it: aspnet/Benchmarks#1480 I am going to forward you an email with a conversation about the overhead that I am talking about |
@adamsitnik wasn't your change reverted? |
having said that, I am going to close this issue when #35800 gets merged |
And #35800 was merged, so closing ... I hope you didn't want to wait for more confirmation from official perf lab runs ... |
It looks like I've introduced a regression to non-pipelined platform benchmarks in #2346
This has been noticed by @sebastienros immediately, I am creating the issue with a huge delay (it was impossible to gather traces on TE machines for a while)
The text was updated successfully, but these errors were encountered: