-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka.Cluster: quarantining / reachability changes appear to be extremely sensitive #4849
Comments
Is this related to #4757 ? Every single time a DOWN does not seem to propagate is because there's at least one node in TERMINATED state. |
Changed the title to: "Akka.Cluster: quarantining / reachability changes appear to be extremely sensitive" Looks to me like there's some issues, which can be easily reproduced during a large Akka.Cluster deployment (i.e. scaling from 3 to 25 nodes via |
https://gist.github.com/Aaronontheweb/66095c9340437c0576cf55876d65c1f7 - some related theories I'm probing related to this |
Worked on this as part of a "heartbeats not setting sent fast enough" hypothesis #4882 |
Working theories as to what is causing this:
Some of these should be pretty easy to rule out by reading through the code - others are going to require a stress test that expands and contracts a cluster running inside Kubernetes, most likely. |
Ruled out the |
@Aaronontheweb Just for clarity, do we know if this is (1) a regression, or (2) an existing issue? Reason I ask is if it's an existing issue and not a regression, one more theory:
I will add, FWIW, that we had to de-tune a -lot- of settings for our application;
I can try to find more specifics if it will help. |
Recency bias makes it look like item 1 on this list, but I think it's probably an existing issue. What I think is different now versus a couple of years ago: it's gotten a lot easier, with the widespread adoption of Kubernetes and the increased availability of documentation and tooling for working with it within Akka.Cluster, to script larger and more complicated deployments. I can reproduce this problem locally by scaling https://github.com/petabridge/Petabridge.Phobos.Web from 3 nodes to 25 with a single |
But yes! More specifics will help - if this issue can be fixed by making the failure detectors less sensitive, that's an acceptable fix. |
This actually might be part of the issue, as I see this often too - lots of nodes moving immediately to |
Going to port this missing MNTR spec from the JVM since it addresses the scenario I described above: https://github.com/akka/akka/blob/master/akka-cluster/src/multi-jvm/scala/akka/cluster/StressSpec.scala |
Regarding cluster heartbeats, we are missing #26757 |
Related - one big part of this issue (I'm investigating multiple instances of this at once) was actually the result of a parser edge case in Akka.Bootstrap.Docker: petabridge/akkadotnet-bootstrap#128 Specifically, if you included a quoted FQN (i.e. for a split brain resolver) in the environment variables Akka.Bootstrap.Docker supports for override it was possible to corrupt the This issue primarily affected the I'm about to deploy Lighthouse v1.5.2, which includes Akka.Bootstrap.Docker, to DockerHub and validate that my reproduction case no longer reproduces: https://github.com/Aaronontheweb/Akka.Cluster.SBRDemo - named as such because I thought it was the old split brain resolvers that were responsible for the damage, didn't occur to me that it could be the environment variable substitute that was responsible at first - turns out the old SBRs were fine. Once that's finished, I think that was a big portion of this issue for at least some of the user cases I'm studying. There are others, however, where I believe the culprit is still the heartbeat system - so I'll continue porting the |
I think many of the quarantines were caused by Lighthouse nodes running off and forming their own clusters, being contacted by old nodes, getting "can't be in two clusters" talk-back, and general confusion in both networks. Messy - caused by a really stupid quoting issue in Akka.Bootstrap.Docker that has been present there for 2-3 years. |
Found another instance of this same issue for the |
Might have found a root cause in #4899 Very strong evidence in the logs indicates that we don't properly clean up heartbeat data for old Not 100% sure that this is the bug we're looking for - but it seems likely. Going to investigate further tomorrow. |
So what I've found via my draft If one of those nodes rotates back into the observation ring after another shift in the cluster (add 2 nodes, shift ring, remove those same 2 nodes, shifts ring back) it's not clear if the heartbeat node ring does the right thing and completely resets the timers under all circumstances. The way the nodes leave the ring may also have to do something with it (i.e. unreachable + downing vs. gracefully leaving may not produce identical changes to the heartbeat node ring.) I can clearly see the Phi values for these "unmonitored, yet still monitored" nodes clearly in the logs in |
Merged in #4934, which includes some fixes that should make it easier to debug heartbeat issues inside the cluster. |
Finally caught some good logs that illustrate the issue really well, from the [INFO][4/15/2021 9:25:12 PM][Thread 0065][akka.trttl.gremlin.tcp://StressSpec@localhost:51667/user/result10] [exercise join/remove round 1] completed in [9661] ms
|
So it looks like we attempt to clean up the akka.net/src/core/Akka.Cluster/ClusterDaemon.cs Lines 1622 to 1623 in 87cc645
and akka.net/src/core/Akka.Cluster/ClusterDaemon.cs Lines 1964 to 1969 in 87cc645
I'm wondering if the problem is that nodes immediately get promoted to We should also be cleaning up these failure detector registries when nodes leave the network too, but it's clear from the logs that doesn't happen consistently. |
I've incorporated this fix into #4899 - which should be available for review soon. I'm also going to submit a port of akka/akka#29665 in a separate PR - once both of those are merged in I think we can put this issue to bed for now. |
Good to hear this. Once this is merged I will test the new nightly build. |
This is ready for review here: #4940 @wesselkranenborg I'd wait until akka/akka#29665 is ported too before we call this issue 100% resolved, but I think this should help. |
Going to start work on porting akka/akka#29665 tomorrow - going to aim for releasing Akka.NET v1.4.19 sometime early next week ideally. |
Keeping this open while I do some additional work on the heartbeat system aimed to resolve some of this. |
Port for |
Possibly related to this issue also: #4963 |
Keeping this issue open because based on feedback from users who are running the nightly builds, this is still an issue. So our remaining leads in terms of what is causing this:
In addition to all of these changes, we've been doing multiple rounds of performance optimization to try to make total Akka.Cluster actor processing time as efficient as possible - we've made some significant gains there so far. Our key measure is the total amount of time it takes the |
Another possibility worth considering - the scheduler itself: #4640 The |
Two things to consider as far as the scheduler and it's behavior as it sits today:
|
Yeah that could certainly do it - the burst of accrued messages flying out all at once after a pause (i.e. Kubernetes throttling pods on a busy system) |
Waiting on confirmation from some groups of users we've been in touch with (via Petabridge) on this issue but I think we've determined that this was mostly an issue with K8s DNS. Although the improvements we made to Akka.Cluster in the interim here certainly don't hurt! |
Reports are good - looks like all of the work we did here paid off. |
Before the changes made as part of this issue, we couldn't stable spin up an Akka.Cluster in AKS with around 5-10 nodes without a lot of nodes being quarantined the whole time and not forming a cluster again. After these changes, I was able to create a stable cluster with around 50 Akka.Cluster nodes with almost no quarantines. Only a few quarantines happened during a load test but after that, the cluster stabilized itself again. So I can indeed confirm that all the work you did here did definitely pay off. Maybe one small addition to the k8s DNS you mentioned. We were not facing that issue in our cluster. We already use NodeLocal DNSCache because we earlier had DNS issues and this solved a lot of our performance and DNS resolution issues. Thanks for the hard work! |
That's good to know - glad the work we did here paid off. We'll get v1.4.19 released soon. |
Akka.NET version: 1.4.17
As the description states, in a cluster with the following composition:
~20 unreachable nodes
~40 weakly up nodes
~30 up nodes
Any attempts to
Down
the unreachable nodes successfully change their status fromUp
/WeaklyUp
->Down
, but they are never correctly exited from the cluster nor are any of theWeaklyUp
nodes promoted toUp
. Using a split brain resolver and apbm cluster down-unreachable
command have both demonstrated this.This looks like a gossip convergence issue to me - going to investigate.
The text was updated successfully, but these errors were encountered: