-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU Load for idle clusters #5400
Comments
The |
#5390 might also help - alters the "waiting" mechanism used by the |
@markusschaber You can use my https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler I made view PR's to fix cluster startup and hope that it is fixed with 1.4.29 My ChannelTaskScheduler does not reduce all idle CPU to zero (as it should be) |
The other option would to switch the dispatcher to the build in but not used TaskPoolDispatcher, |
Can you test with the latest nightly to verify these are fixed?
…Sent from my iPhone
On Nov 26, 2021, at 8:25 AM, Andreas Dirnberger ***@***.***> wrote:
@markusschaber You can use my https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler
But u need to downgrade to Akka 1.4.21, because of some improvements of Ask it is very racy at startup in later versions
I hope the racy startup will be fixed with 1.4.29
My ChannelTaskScheduler does not reduce all idle CPU to zero (as it should be)
but removes the scaling issue completely.
I run currently 16 nodes on k8s and everyone are idling with between 20m and 50m
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
As for the DotNetty CPU issues - can’t fix those without replacing the transport, which we are planning on doing but it’s a ways out.
…Sent from my iPhone
On Nov 26, 2021, at 9:06 AM, Aaron Stannard ***@***.***> wrote:
Can you test with the latest nightly to verify these are fixed?
Sent from my iPhone
>> On Nov 26, 2021, at 8:25 AM, Andreas Dirnberger ***@***.***> wrote:
>>
>
> @markusschaber You can use my https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler
> But u need to downgrade to Akka 1.4.21, because of some improvements of Ask it is very racy at startup in later versions
>
> I hope the racy startup will be fixed with 1.4.29
>
> My ChannelTaskScheduler does not reduce all idle CPU to zero (as it should be)
> but removes the scaling issue completely.
> I run currently 16 nodes on k8s and everyone are idling with between 20m and 50m
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub, or unsubscribe.
|
@Aaronontheweb Unit tests traffic is going through already full encrypted with crypto routing. One other option that we could start to think about is the multi-home problem Currently the Akka.Cluster has even a problem with normal DNS names |
@Zetanova I'll try to test on Monday. We're in european time zone :-) |
i tried now the nigthly build 1.4.29-betaX and it did fix the cluster startup problems. |
@Zetanova this looks pretty nice, it looks like it might be able to support TCP as well with some work? (Just thinking about environments where UDP might be an issue). FWIW Building a Transport is deceptively simple with one important caveat. @Aaronontheweb can correct my poor explanation here but there's a point during handshaking that some of the in flows need to remain 'corked' while the AssociationHandle is being created. In the DotNetty transport this is handled via setting channel's |
@to11mtm UDP fits very well for Mesh/VPN and encryption, Wireguard itself would be optimal for akka too, but HAS use restriction. Akka is a only message based system and don't really need a persistent connection between each node, |
Hmm. When I build our real services against 1.4.29-beta637735681605651069, the load per service seems to be a bit lower, but still around 190 mCPU compared to 210 mCPU with 1.4.27. (Are there nightly builds of lighthouse I could use to nail down where the difference is?) I could not try the Experimental ChannelTaskScheduler yet. It seems there's no NuGet package available, and our Policy forbids copy-pasting 3rd party code into our projects, so I'll need to package it and host it on our internal NuGet Feed, which takes some time (busy with other work right now...) |
@markusschaber ah, my comment was for @Zetanova to resolve his startup issue with the |
Ah, I see... And it seems that there's quite some jitter in the mCPU usage, after some time, I also get phases with about 210 mCPU wit the nightly... |
After hacking together a solution using the https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler with 1.4.29-beta637735681605651069, it got considerably better. Running the same services with the default hocon for the ChannelTaskScheduler, the CPU usage is down to 60-90 mCPU, so this is around 1/2 to 1/3 of the original CPU usage. |
The times I've tested that, there have been some throughput tradeoffs - but on balance that might be the better trade for your use case. In terms of replacing the DotNetty transport - I'd be interested in @Zetanova's ideas there and I have one of my own (gRPC transport - have some corporate users who rolled their own and had considerably higher throughput than DotNetty) that we can try in lieu of Artery, which is a much bigger project. |
Thanks for your efforts. I'm looking forward to an official solution, which can be used in production code without bending compliance rules. :-) |
Naturally - if @Zetanova is up for sending in a PR with the upgraded As for some alternative transports, I'd need to write up something lengthier on that in a separate issue but I'm open to doing that as well - even prior to Akka.NET v1.5 and Artery. |
Thank you very much! As far as I can see, the main issue with the schedulers are the busy loops, things like Thread.Sleep(0) in tight loops seem to burn most of the CPU in our case. I might try to look into that on my own, and submit a pull request if anything valuable comes out. If anything possible, I'd like to have something more like 10-20 mCPU per Service if there's no traffic... |
"Expensive waiting" is a tricky problem - that and scaling the |
I'm not sure whether busy waiting actually brings enough benefits, compared to just using a lock / SemaphoreSlim or similar primitives using the OS scheduler. (As far as I know, "modern" primitives like SemaphoreSlim already use optimized mechanisms like futexes and fine-tuned spinning under the hood.) Independently, one could argue that any starvation by using the normal thread pool is either a misconfiguration of the thread pool (not enough minimum threads), or a misuse of the thread pool (long running tasks should go to a dedicated thread, blocking I/O should be replaced by async, etc...). Whether that kind of reasoning is acceptable by your users is an entirely different question, and apparently, minds much smarter than me have to fight tricky thread starvation problems (see https://ayende.com/blog/177953/thread-pool-starvation-just-add-another-thread or StephenCleary/AsyncEx#107 (comment) for examples...) - there's a reason one of our services had a line like |
In our case, the issue is simple: Of the solutions we tried years ago (i.e. Akka.NET 1.0-1.1,) separating the workloads at the thread level was what offered the highest throughput in exchange for the least amount of total complexity. Fine-tuning the performance of how that Our job itself isn't so simple - the prioritization has to be handled somewhere; delegating everything to the |
Hmm, having a closer look at the DedicatedThreadPool, it says:
Maybe we could just solve this problem with some kind of "Stack" of "SemaphoreSlim" or similar, so we just wake up one thread at a time - the most recent waiter being the one on top of the stack. On the other hand, I'm not really sure whether the implied definition of "locality" really fits modern "big iron" hardware which require NUMA awareness etc. for best results. I see a contradiction between "the more CPUs we have, the bigger the chance that another CPU will queue some work while we poll" and "the more CPUs we have, the less likely the thread which most recently has begun waiting is acutally on the right CPU (or close to it in NUMA sense)." Of course, this usually does not apply to "small" machines like single-socket desktop machines, but on those, it's also less likely that another CPU can queue other work when all CPUs are busy polling on the UnfairSemaphore. ;-) |
I bet we could parameterize https://github.com/akkadotnet/akka.net/blob/dev/src/benchmark/Akka.Benchmarks/Actor/PingPongBenchmarks.cs to switch between How would you create a benchmark to measure idle CPU? I've wondered about that in the past but without firing up and external system like Docker and collecting system metrics on a Lighthouse instance I'm not sure how to automate that. |
I just noticed this issue has been fixed in the latest release. Nice! 👍 |
You can thank the .NET team for that one - no longer needed our dedicated thread pool any more, which is where the high CPU utilization was coming from. |
Version Information
reproduced with 1.4.27 and 1.4.28
Akka clustering, lighthouse
Describe the bug
Idle akka clusters burn too much CPU.
To Reproduce
Steps to reproduce the behavior:
watch kubectl top pods --namespace=akka-cqrs
"Expected behavior
CPU load should be negible (not exactly 0, as some cluster gossip is happening...)
Actual behavior
Even with 2 replicas, the CPU usage is rather high for an idle system. However, when increasing the number of replicas, the CPU usage per service also increases:
Starting 50 replicas straightly renders my kubernetes unusable, kubectl commands fail with various timeout errors.
Screenshots
Output of
watch kubectl top pods --namespace=akka-cqrs
with 16 replicas:Environment
Happens in different environments, the tests above were taken in a VM running Ubuntu, with 6 CPUs and 8GB Ram, running a single-node kubernetes cluster with microk8s installed via snap:
kubectl versions:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-18T02:34:11Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.3-3+9ec7c40ec93c73", GitCommit:"9ec7c40ec93c73c2281bdd2e4a75baf6247366a0", GitTreeState:"clean", BuildDate:"2021-11-03T10:17:37Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Additional context
Can be a serious cost factor in environments which are to be payed per CPU usage, like some cloud services. In our case it's some test and dev environments which are configured "smallish", and burn their "cpu burst quota" rather quickly.
This might be related to #4537 .
The text was updated successfully, but these errors were encountered: