Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Idle CPU in DotNetty #4636

Closed
Zetanova opened this issue Nov 17, 2020 · 45 comments · Fixed by #4685
Closed

High Idle CPU in DotNetty #4636

Zetanova opened this issue Nov 17, 2020 · 45 comments · Fixed by #4685

Comments

@Zetanova
Copy link
Contributor

I still have the issue with idle nodes more or less like in #4434
docker akka.net 1.4.11 dotnet 3.1.404 debug and release builds

All 7 nodes are idling and consume 100% (docker is limited to 3 cores)
The main hotpatch is still in dotnetty

image

messages/traffic is low the node is idling.

@Zetanova
Copy link
Contributor Author

Zetanova commented Nov 20, 2020

The ConcurrentQueue in the class Helios.Concurrency.DedicatedThreadPool.ThreadPoolWorkQueue could be replaced with the new System.Threading.Channels API of the dotnet/runtime

This would get rid of the UnfairSemaphore implementation for good or bad.

private class ThreadPoolWorkQueue
{
private static readonly int ProcessorCount = Environment.ProcessorCount;
private const int CompletedState = 1;
private readonly ConcurrentQueue<Action> _queue = new ConcurrentQueue<Action>();
private readonly UnfairSemaphore _semaphore = new UnfairSemaphore();
private int _outstandingRequests;
private int _isAddingCompleted;

I don't have the setup/knowledge to measure the perf effects of this change, can test it?

@Zetanova
Copy link
Contributor Author

I made a branch https://github.com/Zetanova/akka.net/tree/helios-idle-cpu with a commit that changes the ThreadPoolWorkQueue to System.Threading.Channels.Channel

Please, can somebody run a test and benchmark

Or explain me how to run get the Akka.MultiNodeTestRunner.exe started.

@Aaronontheweb
Copy link
Member

Cc @to11mtm - guess I need to move up the time table on doing that review

@Aaronontheweb
Copy link
Member

@Zetanova I’ll give your branch a try - OOF for a couple of days but I’ll get on it

@to11mtm
Copy link
Member

to11mtm commented Nov 21, 2020

@Zetanova I'll try to run this through the paces as well in the next few days. :)

@Zetanova
Copy link
Contributor Author

I made a helios-io/DedicatedThreadPool fork https://github.com/Zetanova/DedicatedThreadPool/tree/try-channels

The problem is that the benchmark does not count the spin waits / idle CPU

CURRENT

--------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ---------------
--------------- DATA ---------------
TotalBytesAllocated: Max: 3 227 648,00 bytes, Average: 3 220 716,31 bytes, Min: 3 219 456,00 bytes, StdDev: 3 076,37 bytes
TotalBytesAllocated: Max / s: 231 256 177,45 bytes, Average / s: 192 894 727,69 bytes, Min / s: 148 458 491,46 bytes, StdDev / s: 28 365 516,34 bytes

TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen0]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen1]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen2]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

[Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations
[Counter] BenchmarkCalls: Max / s: 7 183 082,40 operations, Average / s: 5 989 015,64 operations, Min / s: 4 611 291,21 operations, StdDev / s: 879 635,08 operations

WITH CHANNEL

------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ----------

--------------- RESULTS: Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ---------------
--------------- DATA ---------------
TotalBytesAllocated: Max: 3 219 456,00 bytes, Average: 3 219 456,00 bytes, Min: 3 219 456,00 bytes, StdDev: 0,00 bytes
TotalBytesAllocated: Max / s: 233 222 932,15 bytes, Average / s: 205 354 466,34 bytes, Min / s: 157 580 871,74 bytes, StdDev / s: 28 045 400,62 bytes

TotalCollections [Gen0]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen0]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

TotalCollections [Gen1]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen1]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

TotalCollections [Gen2]: Max: 0,00 collections, Average: 0,00 collections, Min: 0,00 collections, StdDev: 0,00 collections
TotalCollections [Gen2]: Max / s: 0,00 collections, Average / s: 0,00 collections, Min / s: 0,00 collections, StdDev / s: 0,00 collections

[Counter] BenchmarkCalls: Max: 100 000,00 operations, Average: 100 000,00 operations, Min: 100 000,00 operations, StdDev: 0,00 operations
[Counter] BenchmarkCalls: Max / s: 7 244 172,06 operations, Average / s: 6 378 545,52 operations, Min / s: 4 894 642,81 operations, StdDev / s: 871 122,35 operations

------------ FINISHED Helios.Concurrency.Tests.Performance.DedicatedThreadPoolBenchmark+ThreadpoolBenchmark ----------

@Zetanova
Copy link
Contributor Author

Event if the default implementation of the dotnet/runtime does not fit,
we could implement a custom Channel and reuse it elsewhere,
event if its only the ChannelReader subpart

To increase and decrease the thread-workers is easily possible.
Because all awaiting thread-workers are awoken on new work,
thread-workers can count misses in there loop
and they can stop by them self or the DTP can mark them to stop.

Even a Zero-Alive thread-worker scenario could be possible.

Here are view link on the channel topic

From the Pro:
https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/

Source Code:
https://github.com/dotnet/runtime/blob/master/src/libraries/System.Threading.Channels/src/System/Threading/Channels/UnboundedChannel.cs

Detail blog post:
https://www.stevejgordon.co.uk/dotnet-internals-system-threading-channels-unboundedchannel-part-1

@to11mtm
Copy link
Member

to11mtm commented Nov 21, 2020

@Zetanova I ran some tests against the branch in #4594 to see whether this helped/hurt.

Background: On my local machine, under RemotePingPong, Streams TCP Transport gets up to 300k messages/sec if everything runs on the normal .NET Threadpool.

  • Exisitng DedicatedThreadPool - 180,000-220,000 Msg/sec (This variance increased when the internal-dispatcher changes were merged in @Aaronontheweb, but I'm going to blame my dirty laptop to some extent)
  • System.Threading.Channels based DedicatedThreadPool - 130,000-170,000 Msg/sec
  • System.Threading.Channels based DedicatedThreadPool, AllowSynchronousContinuations = true, - 200,000-220,000 Msg/Sec

I think this could be on the right track, I know that UnfairSemaphore has been somewhat supplanted/deprecated in Core at this point too.

@Zetanova
Copy link
Contributor Author

@to11mtm thx for the test, because i didnt know what continues with AllowSynchronousContinuations so i didn't set it.

Kestrel is using System.Thrading.Channels for the connection management.

What's importent is to test the idle state of a cluster under windows and/or linux.
The current akka 1.4.12 has 100-120m in a k8s cluster per akka-node idling.
and on my dev machine with a 3 CPU limit for docker wsl2 25% per node idling.
It does not matter if the node has custom actors running or is only connected to 'empty' to the cluster.

On my on-premise k8s cluster it does not matter that much, but on an AWS or AZURE it does a lot.
Near all EC2 Instance without "unlimited" supports more then 20% CPU/Core constant load.

I will try now to implement an autoscaler for the DTP.

@to11mtm
Copy link
Member

to11mtm commented Nov 21, 2020

@Zetanova I think it's definitely on the right path. If you can auto-scale that might help too, what I noticed in the profiler is we still have all of these threads waiting for channel reads very frequently, I'm not sure if there's a cleaner way to keep them fed...

@to11mtm
Copy link
Member

to11mtm commented Nov 21, 2020

Sorry, one more note...

I wonder whether we should peek at Orleans Schedulers for some inspiration?

At [one point] (https://github.com/dotnet/orleans/pull/3792/files) they were actually using a variation of our Threadpool complete with credited borrowing of UnfairSemaphore. It doesn't look like they use that anymore, so perhaps we can look at how they evolved and take some lessons.

@Aaronontheweb
Copy link
Member

This looks like a relatively simple change @Zetanova @to11mtm - very interested to see what we can do. Replacing the DedicatedThreadPool with System.Threading.Channels makes a lot of sense to me.

@Aaronontheweb
Copy link
Member

I wonder whether we should peek at Orleans Schedulers for some inspiration?
At [one point] (https://github.com/dotnet/orleans/pull/3792/files) they were actually using a variation of our Threadpool complete with credited borrowing of UnfairSemaphore. It doesn't look like they use that anymore, so perhaps we can look at how they evolved and take some lessons.

I'm onboard with implementing good ideas no matter where they come from. The DedicatedThreadPool abstraction was something we created back in... must have been 2013 / 2014. It's ancient. .NET has evolved a lot since in terms of the types of scheduling primitives it allows.

@Aaronontheweb
Copy link
Member

I think a major part of the issue with the DedicatedThreadPool, as this is was something we looked at prior to the internal dispatcher change, is that it pre-allocates all threads up front - therefore you're going to have a lot of idle workers lying around checking for work in systems that aren't busy. The design should be changed to auto-scale threads up and down.

I suggested a few ways of doing this - one was to put a tracer round in the queue and measure how long it took to make it to the front. The other was to measure the growth in the task queue and allocate threads based on growth trends. Both of these have costs in terms of complexity and raw throughput, but the advantage is that in less busy or sporadically busy system they're more efficient at conserving CPU utilization.

@Aaronontheweb
Copy link
Member

Looks like the CLR solves this problem via a Hill-climbing algorithm to continually try to optimize the thread count https://github.com/dotnet/runtime/blob/4dc2ee1b5c0598ca02a69f63d03201129a3bf3f1/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.HillClimbing.cs

@Aaronontheweb
Copy link
Member

Based on the data from this PR that @to11mtm referenced: dotnet/orleans#6261

An idea: the big problem we've tried to solve by having separate threadpools was ultimately caused by the idea of work queue prioritization - that some work, which is time sensitive, needs to have a shorter route to being actively worked on than others.

The big obstacle we've run into historically with the default .NET Threadpool was that its work queue can grow quite large, especially with a large number of Tasks, /user actor messages, and so on - and as a result of this the /system actors, whose work is much more concentrated and time sensitive, suffered as a result.

What if we solved this problem by having two different work queues routing to the same thread pool rather than two different work queues routing to separate thread pools? If we could move the /system and /user actors onto separate dispatchers, each with their own work queue (which we'd have to implement by creating something that sits above the Threadpool, i.e. a set of separate System.Threading.Channels.Channel<T> instance), but both of them still used the same underlying threads to conduct the work.

The problems that could solve:

  1. No idle CPU issues;
  2. Don't have to reinvent the wheel on thread-management issues; and
  3. Still accomplishes the goal of mutually exclusive / prioritized queues.

The downsides are that outside of the Akka.NET dispatchers, anyone can queue work onto the underlying threadpool - so we might see a return of the types of problems we had around Akka.NET 1.0 where time sensitive infrastructure tasks like Akka.Remote / Akka.Persistence time out due to the length of the work queue.

I'd be open to experimenting with that approach too and ditching the idea of separate thread pools entirely.

@to11mtm
Copy link
Member

to11mtm commented Nov 21, 2020

The downsides are that outside of the Akka.NET dispatchers, anyone can queue work onto the underlying threadpool - so we might see a return of the types of problems we had around Akka.NET 1.0 where time sensitive infrastructure tasks like Akka.Remote / Akka.Persistence time out due to the length of the work queue.

Perhaps then it makes sense to keep the existing one around if this route is taken? that way if you are unfortunately having to deal with noisy code for whatever reason in your system, you can at at least 'pick your poison'.

This does fall into the category of 'Things that are easier to solve in Net Core 3.1+'; 3.1+ lets you look at the work queue counts, at that point we could 'spin up' additional threads if the work queue looks too long.

@Zetanova
Copy link
Contributor Author

Yes the DedicatedThreadPool is not ideal.
I am currently working on it only to simplify it and maybe remove the idle-cpu issue.

2-3 channels to queue work on priority inside a single dispatcher would be the why to go.
channel-3: instantly/work stealing
channel-2: high/short work
channel-1: normal/long work

The queue algo could be very simple like:

  1. Queue or maybe execute all work from channel-3
  2. Queue view work items from channel-2
  3. If channel-3 or channel-2 had work
    then queue only one work item from channel-1
    else queue view work items from channel-1
  4. if there was no work then wait on channel-1, channel-2 or channel-3
  5. repeat with 1)

Maybe channel-3 is not needed and a flag to directly direct execute the work-item can be used.

If an external source queues to much work on the ThreadPool,
other components like Sockets will have issues, not only Akka.Remote
Akka.net should not try to "resolve" this external issue,

I will try to look into the Dispatcher next after DedicatedThreadPool.

@Zetanova
Copy link
Contributor Author

@to11mtm pls benchmark my commit again https://github.com/Zetanova/akka.net/tree/helios-idle-cpu
I added a simple auto-scaler

If possible pls form a 5-7 node cluster and look at the idle CPU state.

Maybe if somebody has time to explain to me how to start the benchmarks and MultiNode tests.
Somehow i don't get it.

@to11mtm
Copy link
Member

to11mtm commented Nov 22, 2020

@Zetanova - Looks like this last set of changes impacted thorughput negatively; it looks like either we are spinning up new threads too slowly, or there's some other overhead negatively impacting us as we try to ramp up.

What I'm measuring is the Messages/Sec of RemotePingPong on [this branch])(https://github.com/to11mtm/akka.net/tree/remote-full-manual-protobuf-deser); If you can build it you should be able to run it easily enough.

Edit: It's kinda all over the place with this set of changes, anywhere from 100,000 to 180,000 msg/sec

If possible pls form a 5-7 node cluster and look at the idle CPU state.

Unfortunately I don't have a cluster setup handy that I can use for testing this, Won't have time to set one up for quite some time either :(

@Zetanova
Copy link
Contributor Author

@to11mtm thx for the run.
on how many cores are u testing?
maybe it is just that i set the max thread count to Environment.ProcessorCount-1;
https://github.com/Zetanova/akka.net/blob/61a0d921d74ac10b8aaba6bc09cc0f25bff87ed3/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L53

Currenty the scheduler checks every 50 work items to reschedule.
There is now a _waitingWork counter, that we could use to force an thread increase.

But the main problem is not to support max throughput,
its to test if it scales down and/or the idle CPU issue gets resolved.

@Zetanova
Copy link
Contributor Author

@to11mtm I checked again and found a small error and made a new commit.

misstake is _cleanCounter = 0 should be _cleanCounter = 1
https://github.com/Zetanova/akka.net/blob/7dd6279dac948dea23bd87d252717fc28ea9728a/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L328-L333

Else it should be more or less the same like on the fist commit without the auto-scaler.

It sets up MaxThread from the start and scales down only if there is very low work count.
Under load like PingPong and RemotePingPong there is no down scaling happening.

I could not run RemotePingPong because of some Null execption on startup,
but PingPong did, looked ok.

My CPU run with 'only' 60% thats because of the Intel Hyper-Threading

@Zetanova
Copy link
Contributor Author

@Aaronontheweb Could take a look?
If it takes longer then I would need to replace my intel i7-920 of my dev machine after 11years.

@Aaronontheweb
Copy link
Member

@Zetanova haven't been able to get RemotePingPong to run on my machine with these changes yet - it just idles without running

@Zetanova
Copy link
Contributor Author

@Aaronontheweb This is the simplest one and most likely the best performant
Try this branch, it uses the normal ThreadPool:
https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled

@Zetanova
Copy link
Contributor Author

It does work, but akka is not using DedicatedThreadPoolTaskScheduler only DedicatedThreadPool for the ForkJoinExecutor.

@Aaronontheweb
Copy link
Member

I'm taking some notes as I go through this - we really have three issues here:

  1. Our custom Thread implementations are inefficient at managing scenarios where there isn't enough scheduled work to do - this is true for DotNetty, the scheduler, and the DedicatedThreadPool. Not a problem that anyone other than the .NET ThreadPool has solved well. Automatically scaling the thread pools up and down with demand would solve a lot of those problems. Hence why we have issues such as HashedWheelTimerScheduler is spending 35% of execution time in sleep #4031
  2. We have too many thread pools - DotNetty has two of its own, we have the .NET ThreadPool, and when running Akka.Remote we have one dedicated thread pool for remoting and a second one for all /system actors, plus a dedicated thread for the scheduler. All of those custom thread pool implementations are excellent for separating work queues, but not great at managing threads efficiently within a single Akka.NET process.
  3. Some of these threadpool implementations are less efficient than others - the DotNetty scheduler, for instance, appears to be particularly inefficient when it's used. Hence some of the issues we've had historically with the Akka.Remote batching system on lower bandwidth machines.

Solutions, in order of least risk to existing Akka.NET users / implementations:

  1. Rewrite the DedicatedThreadPool to scale up and scale down, per @Zetanova's attempts - that's a good effort and can probably be optimized without that much effort. I'd really need to write an Idle CPU measurement and stick that in the DedicatedThreadPool repository, which I don't think would be terribly hard.
    1a. Migrate the DotNetty Single Thread Event Executor / EventLoopGroup to piggy-back off of the Akka.Remote dispatcher. Fewer threads to manage and keep track of idle / non-idle times.
    1b. Migrate the Akka.Remote.DotNettyTransport batching system to piggy-back off of the HashedWheelTimer instead of DotNetty itself. If all of that can be done successfully, then none of DotNetty's threading primitives should be used.
  2. Rewrite the dispatchers to implement TaskSchedulers and rewrite all mailbox processing to occur as a single-shot Task. This is something we've discussed as part of the 1.5 milestone anyway and it would solve a lot of problems for Akka.NET actors (i.e. AsyncLocal now works correctly from inside actors, all Tasks initiated from inside an actor get executed inside the same dispatcher, etc.) The risk is that there are a lot of potentially unknown side effects and it will require introducing new APIs and deprecating old ones. Most of these APIs are internal so it's not a big deal, but some of them are public and we always need to be careful with that. The thread management problems in this instance would be solved by moving all of our work onto the .NET ThreadPool and simply using different TaskScheduler instances to manage the workloads on a per-dispatcher basis.

I'm doing some work on item number 2 to assess how feasible that is - since that can descend into yak-shaving pretty quickly.

Getting approach number 1 to work is more straightforward and @Zetanova has already done some good work there. It's just that I consider approach number 2 to be a better long-term solution to this problem, and if it's only marginally more expensive to implement that then that's what I'd prefer to do.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 24, 2020

Some benchmark data from some of @Zetanova's PRs on my machine (AMD Ryzen 1st generation)

As a side note: looks like we significantly increased the number of messages written per round. That is going to crush the nuts of the first round of this benchmark due to the way batching is implemented - we can never hit the treshhold so long as the number of messages per round / per actor remains low on that first round. But, that's a good argument for leaving batching off by default I suppose.

dev:

ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 1434, 139533.84
5, 1000000, 191022, 5235.61
10, 2000000, 181703, 11007.80
15, 3000000, 179781, 16687.83
20, 4000000, 170904, 23405.72
25, 5000000, 176704, 28296.62
30, 6000000, 175856, 34119.68
Done..

helios-idle-cpu-pooled:

ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 1194, 167506.14
5, 1000000, 156765, 6379.06
10, 2000000, 156556, 12775.24
15, 3000000, 158815, 18890.32
20, 4000000, 164908, 24256.93
25, 5000000, 165810, 30155.52

helios-idle-cpu

ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True

Num clients, Total [msg], Msgs/sec, Total [ms]
1, 200000, 1215, 164698.09
5, 1000000, 192419, 5197.48
10, 2000000, 190477, 10500.94
15, 3000000, 185679, 16157.99
20, 4000000, 183209, 21833.07
25, 5000000, 126657, 39477.82
30, 6000000, 192314, 31199.53

@Zetanova
Copy link
Contributor Author

@Aaronontheweb Thx for testing.

In the 'helios-idle-cpu-pooled' branch is only a mode of the DedicatedThreadPoolTaskScheduler
that schedules work on the dotnet ThreadPool. I fought that akka is already using TaskScheduler in the Dispatcher, it does not use it.
https://github.com/Zetanova/akka.net/blob/0fb700d0754c447652e121337ca41fd44900eb65/src/core/Akka/Helios.Concurrency.DedicatedThreadPool.cs#L114-L267

You can use it for your approach 2). If the dispatchers would use this TaskScheduler then WorkItems would be processes in a loop in parallel up to ProcessorCount and a pooled Thread would be released only after an empty WorkItems queue.
It is the same as before but without the custom DeticatedThreadPool implementation.

If the .net ThreadPool is not creating threads fast enough, it could be manipulated with ThreadPool.SetMinThreads

@Aaronontheweb
Copy link
Member

@Zetanova I think you have the right idea with your design thus far.

After doing some tire-kicking on approach number 2 - that's a big hairy redesign that won't solve problems for people with idle CPU issues right now. I'm going to suggest that we try approach number 1 and get a fix out immediately so we can improve the Akka.NET experience for users running on 1.3 and 1.4 right now. Implementing approach number 2 will likely need to wait until Akka.NET v1.5.

@Zetanova
Copy link
Contributor Author

@Aaronontheweb I made now simple new commit. It replaces the ForkJoinExecutor with the TaskSchdulerExecuter
but uses the new DedicatedThreadPoolTaskScheduler
https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled

PingPong works good, memory and GC got lower.

Even with this change there will be most likely a high decrease in idle CPU.

If possible pls test this one with RemotePingPong too,

@Aaronontheweb
Copy link
Member

Will do - I'll take a look. I'm working on an idle CPU benchmark for DedicatedThreadPool now - if that works well I'll do one for Akka.NET too

@Aaronontheweb
Copy link
Member

Working on some specs to actually measure this here: helios-io/DedicatedThreadPool#23

@Aaronontheweb
Copy link
Member

So in case you're wondering what I'm doing, here's my approach:

  1. Measure the actual idle CPU utilization on a DedicatedThreadPool that has zero work using docker stats - be able to do this repeatedly via a unit test. I want this so we have a quantifiable baseline;
  2. Implement a "hill climbing" algorithm used to determine when to add a thread, remove a thread, and so on - and test the algorithm using FsCheck to validate its output under dynamic and changing circumstances;
  3. Replace / upgrade the DedicatedThreadPool to implement said algorithm and reduce idle cpu utilization and improve dynamic behavior at run-time.

@Aaronontheweb
Copy link
Member

The UnfairSemaphore in the DedicatedThreadPool does an excellent job limiting the number of threads from creeping up when CPU count is low, which I've been able to verify via manually changing the CPU levels up and down. Running 1600 idle threads on a 16 core machine = 0% CPU once the queue is empty.

I can't even reproduce the idle CPU issues at the moment - so it makes me wonder if the issues showing up in Akka.NET have another side effect (i.e. intermittent load applied by scheduler-driven messaging) that is creating the issue. I'm going to continue to play with this.

@Aaronontheweb
Copy link
Member

Running an idle Cluster.WebCrawler cluster:

CONTAINER ID        NAME                                            CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O
 PIDS
a17996cdd6f6        clusterwebcrawler_webcrawler.web_1              10.75%              87.4MiB / 50.17GiB    0.17%               120kB / 122kB       0B / 0B
 57
c548cb431955        clusterwebcrawler_webcrawler.crawlservice_1     8.25%               44.29MiB / 50.17GiB   0.09%               125kB / 123kB       0B / 0B
 39
06e38eed576d        clusterwebcrawler_webcrawler.trackerservice_1   10.75%              46.03MiB / 50.17GiB   0.09%               130kB / 127kB       0B / 0B
 39
214aec75d2b5        clusterwebcrawler_webcrawler.lighthouse2_1      0.53%               33.39MiB / 50.17GiB   0.07%               1.16kB / 0B         0B / 0B
 22
4996a84e06ef        clusterwebcrawler_webcrawler.lighthouse_1       5.10%               42.62MiB / 50.17GiB   0.08%               134kB / 133kB       0B / 0B

Lighthouse 2 has no connections - it's not included in the cluster. This tells me that there's something other than the DedicatedThreadPool design itself that is responsible for this. Even on a less powerful Intel machine I can't generate much idle CPU using just the DedicatedThreadPool.

@to11mtm
Copy link
Member

to11mtm commented Dec 5, 2020

Looks like the CLR solves this problem via a Hill-climbing algorithm to continually try to optimize the thread count https://github.com/dotnet/runtime/blob/4dc2ee1b5c0598ca02a69f63d03201129a3bf3f1/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.HillClimbing.cs

Interesting... PortableThreadPool is newer bits. Too bad it's still very tightly coupled and not re-usable.

Lighthouse 2 has no connections - it's not included in the cluster. This tells me that there's something other than the DedicatedThreadPool design itself that is responsible for this. Even on a less powerful Intel machine I can't generate much idle CPU using just the DedicatedThreadPool.

Thought:

All 7 nodes are idling and consume 100% (docker is limited to 3 cores)

Has anything been done to check if this is a resource constraint issue? HashedWheelTimer and Dotnetty Executor will each take one thread on their own, alongside whatever else each DTP winds up doing.

@Aaronontheweb
Copy link
Member

yeah, that was my thinking too @to11mtm - I think it's a combination of factors.

One thing I can do - make an IEventLoop that runs on the Akka.Remote dispatcher so DotNetty doesn't fire up its own threadpool. It might be a bit of a pain in the ass but I can try.

@to11mtm
Copy link
Member

to11mtm commented Dec 5, 2020

yeah, that was my thinking too @to11mtm - I think it's a combination of factors.

One thing I can do - make an IEventLoop that runs on the Akka.Remote dispatcher so DotNetty doesn't fire up its own threadpool. It might be a bit of a pain in the ass but I can try.

Looks at everything needed to implement IEventLoop and it's inheritors. Ouch. That said, there could be some ancillary benefits from being on the same threadpool in that case, data cache locality and the like. I know with my transport work, there were some scenarios where putting everything in the same pool (i.e. remote, tcp workers, streams) gave benefits. Not just from a 'less threadpool' standpoint either... There were some scenarios where a dispatcher with affinity (science experiment here) gave major boosts to performance in low message traffic scenarios.

@Zetanova
Copy link
Contributor Author

Zetanova commented Dec 5, 2020

@Aaronontheweb The issue is only in a formed cluster with or without load. There can be no user-actors on the node.

What makes most of "idle-cpu" usage is the spin-lock.
most of the mutex/timers are doing it before they thread gets free/paused

If there is Absolut no work there are no spin-waits,
but if one work item comes from time to time (500ms, 1000ms)
the spins will happen.

The akka scheduler is ticking with 100ms
I think cluster/dotnetty is implemented with an ticker too.

@Aaronontheweb pls try a cluster with 3-5 nodes https://github.com/Zetanova/akka.net/tree/helios-idle-cpu-pooled
I disabled there the DTP completely
or pls tell me how i can run the MultiNode UnitTests, somehow i don't get it

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.13, 1.4.14 Dec 16, 2020
@Aaronontheweb
Copy link
Member

https://github.com/Aaronontheweb/akka.net/tree/feature/IEventLoopGroup-dispatcher - tried moving the entire DotNetty IEventLoopGroup on top of the Akka.Remote dispatcher. Didn't work - DotNetty pipeline is tightly coupled to its concurrency constructs. Wanted to cite some proof of work here though.

We're working on multiple parallel attempts to address this.

@Zetanova
Copy link
Contributor Author

i am pretty sure that the idle load comes from a spin-wait of an event-handle and the component like DotNetty tick <40ms.
What happens is:

Case A

  1. Work Item arrives (NoOp-Tick or real work item)
  2. Wait Handle gets signaled
  3. Thread awakes
  4. processes work items
  5. Thread has no work, wait for new signal or timeout
  6. Because it waits on a signal it will spin-wait for a short time until the thread gets "full" paused

Case B

  1. Timeout happens
  2. Wait Handle gets signaled by timeout
  3. Thread awakes
  4. no work items to process
  5. Thread has no work, wait for new signal or timeout
  6. Because it waits on a signal it will spin-wait for a short time until the thread gets "full" paused

If the timeout is very low <30ms or the signal of an NoOp-Tick comes very frequently <30ms
the spin-waits of the WaitHandle are adding up.

If the timeout is low, the fix would be just to remove the wait on the signal only in "Case B / Point 5",
to remove the spin-wait

Case B

...
5) Thread has no work, wait ONLY on the a short timeout (aka Thread.Sleep)
...

@Aaronontheweb
Copy link
Member

I'm in agreement on the causes here - just working on how to safely reduce amount of "expensive non-work" occurring without creating additional problems.

@Aaronontheweb
Copy link
Member

Achieved a 50% reduction in idle CPU here: #4678 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants