Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster: quarantining / reachability changes appear to be extremely sensitive #4849

Closed
Aaronontheweb opened this issue Mar 15, 2021 · 37 comments · Fixed by #4940 or #4946
Closed

Comments

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 15, 2021

Akka.NET version: 1.4.17

As the description states, in a cluster with the following composition:

~20 unreachable nodes
~40 weakly up nodes
~30 up nodes

Any attempts to Down the unreachable nodes successfully change their status from Up / WeaklyUp -> Down, but they are never correctly exited from the cluster nor are any of the WeaklyUp nodes promoted to Up. Using a split brain resolver and a pbm cluster down-unreachable command have both demonstrated this.

This looks like a gossip convergence issue to me - going to investigate.

@ismaelhamed
Copy link
Member

Is this related to #4757 ? Every single time a DOWN does not seem to propagate is because there's at least one node in TERMINATED state.

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.18, 1.4.19 Mar 23, 2021
@Aaronontheweb Aaronontheweb changed the title Akka.Cluster: gossip doesn't properly remove downed nodes in clusters with large number of WeaklyUp members Akka.Cluster: quarantining / reachability changes appear to be extremely sensitive Mar 29, 2021
@Aaronontheweb
Copy link
Member Author

Changed the title to: "Akka.Cluster: quarantining / reachability changes appear to be extremely sensitive"

Looks to me like there's some issues, which can be easily reproduced during a large Akka.Cluster deployment (i.e. scaling from 3 to 25 nodes via kubectl), that really shouldn't be occurring. Message traffic isn't all that high when this occurs - I think it has to do with the failure-detector being extremely overtuned in Akka.Cluster.

@Aaronontheweb
Copy link
Member Author

https://gist.github.com/Aaronontheweb/66095c9340437c0576cf55876d65c1f7 - some related theories I'm probing related to this

@Aaronontheweb
Copy link
Member Author

Worked on this as part of a "heartbeats not setting sent fast enough" hypothesis #4882

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 1, 2021

Working theories as to what is causing this:

  • Addition of AppVersion data skews reachability detection when a "rolling upgrade" is detected
  • Port scala akka PR #26816 to Akka.NET #4511 - introduction of the InternalDispatcher actually made heartbeat scheduling less reliable, especially on larger node rings
  • Increase default number of nodes for cluster failure detection #4097 - that this change, which should have made it more difficult for the cluster to arrive at REACHABLE vs. UNREACHABLE decisions, doesn't really have any effect and any single node in the observability ring can effectively mark any of the others as unreachable.
  • Bug in the cluster heartbeat system that causes it to be overly sensitive to early heartbeat data at startup. This would be an issue with the Phi Accrual system if that were the case.
  • Pre-programmed latency between the arrival of Gossip notifying a node about multiple joining nodes vs. the delay in actually receiving incoming connections from those nodes. i.e. 3 up members in cluster, 12 new nodes join, all nodes have to monitor up to 9 other nodes, does the "first heartbeat" countdown start only after the Akka.Remote association goes through or before?

Some of these should be pretty easy to rule out by reading through the code - others are going to require a stress test that expands and contracts a cluster running inside Kubernetes, most likely.

@Aaronontheweb
Copy link
Member Author

Ruled out the AppVersion changes last night - but wanted to include them for correctness. Some of the more recent JVM clustering code factors in AppVersion as part of reachability but none of that has been ported yet (I also think the way they're tracking versions throughout the cluster is probably wrong - seems like it should be done on a per-role basis, not a cluster-wide basis)

@to11mtm
Copy link
Member

to11mtm commented Apr 1, 2021

@Aaronontheweb Just for clarity, do we know if this is (1) a regression, or (2) an existing issue?

Reason I ask is if it's an existing issue and not a regression, one more theory:

  • Schedulers operating at fixed rates rather than fixed delays; under fixed-rate (which is what we do currently), if there are GC pauses or heavy load (and frankly, on our cluster, starting up is the most CPU load we see on the app) this could lead to multiple heartbeats being sent out quickly, which ends up confusing the phi accrual.

Message traffic isn't all that high when this occurs - I think it has to do with the failure-detector being extremely overtuned in Akka.Cluster.

I will add, FWIW, that we had to de-tune a -lot- of settings for our application;

  • Remoting Failure detector (99% sure, will check)
  • Cluster Failure detector
  • SBR Failure Detector (we're still on 1.4.14, fwiw)
  • Coordinated-shutdown settings (this was a biiiig one for sharding stability, needed to give everything enough time to properly shut down)

I can try to find more specifics if it will help.

@Aaronontheweb
Copy link
Member Author

@to11mtm

Just for clarity, do we know if this is (1) a regression, or (2) an existing issue?

Recency bias makes it look like item 1 on this list, but I think it's probably an existing issue. What I think is different now versus a couple of years ago: it's gotten a lot easier, with the widespread adoption of Kubernetes and the increased availability of documentation and tooling for working with it within Akka.Cluster, to script larger and more complicated deployments.

I can reproduce this problem locally by scaling https://github.com/petabridge/Petabridge.Phobos.Web from 3 nodes to 25 with a single kubectl command. Several nodes go unreachable, get downed, and reboot - eventually the cluster forms and remains stable but it takes minutes when it should take seconds. I suspect in the past it was more cumbersome to do this so quickly and thus this problem remained largely undiscovered.

@Aaronontheweb
Copy link
Member Author

But yes! More specifics will help - if this issue can be fixed by making the failure detectors less sensitive, that's an acceptable fix.

@Aaronontheweb
Copy link
Member Author

This actually might be part of the issue, as I see this often too - lots of nodes moving immediately to WeaklyUp and prolonging the general formation process: akka/akka#29665

@Aaronontheweb
Copy link
Member Author

Going to port this missing MNTR spec from the JVM since it addresses the scenario I described above: https://github.com/akka/akka/blob/master/akka-cluster/src/multi-jvm/scala/akka/cluster/StressSpec.scala

@ismaelhamed
Copy link
Member

Regarding cluster heartbeats, we are missing #26757

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 3, 2021

Related - one big part of this issue (I'm investigating multiple instances of this at once) was actually the result of a parser edge case in Akka.Bootstrap.Docker: petabridge/akkadotnet-bootstrap#128

Specifically, if you included a quoted FQN (i.e. for a split brain resolver) in the environment variables Akka.Bootstrap.Docker supports for override it was possible to corrupt the akka.cluster section of HOCON as a result of the un-parsed environment variable not being quoted properly to the HOCON tokenizer. This error was unstable as a result of the Environment.GetEnvironmentVariables call returning things in random orders, so we couldn't reproduce it 100% of the time in real-world clusters. But this has now been fixed in Akka.Bootstrap.Docker v0.5.1

This issue primarily affected the akka.cluster.seed-nodes value in Lighthouse and I was able to reproduce some split brains rather easily as a result (the bad parse resulted in an empty seed node list being produced.)

I'm about to deploy Lighthouse v1.5.2, which includes Akka.Bootstrap.Docker, to DockerHub and validate that my reproduction case no longer reproduces: https://github.com/Aaronontheweb/Akka.Cluster.SBRDemo - named as such because I thought it was the old split brain resolvers that were responsible for the damage, didn't occur to me that it could be the environment variable substitute that was responsible at first - turns out the old SBRs were fine.

Once that's finished, I think that was a big portion of this issue for at least some of the user cases I'm studying. There are others, however, where I believe the culprit is still the heartbeat system - so I'll continue porting the StressSpec and some of the other JVM PRs referenced on this issue.

@Aaronontheweb
Copy link
Member Author

I think many of the quarantines were caused by Lighthouse nodes running off and forming their own clusters, being contacted by old nodes, getting "can't be in two clusters" talk-back, and general confusion in both networks. Messy - caused by a really stupid quoting issue in Akka.Bootstrap.Docker that has been present there for 2-3 years.

@Aaronontheweb
Copy link
Member Author

Found another instance of this same issue for the CLUSTER_SEEDS environment variable too: petabridge/akkadotnet-bootstrap#134 - it does not work well when quoted. We'll get that fixed in a future release.

@Aaronontheweb
Copy link
Member Author

Might have found a root cause in #4899

Very strong evidence in the logs indicates that we don't properly clean up heartbeat data for old HeartbeatNodeRing instances in the logs there, and that it's not an issue for most clusters until one of the old nodes that was previously monitored, and for which we now have very old time-stamped data, becomes one of our active monitored ring members again. This causes an instant phi accrual failure event above threshold to be fired and marks the node as unreachable.

Not 100% sure that this is the bug we're looking for - but it seems likely. Going to investigate further tomorrow.

@Aaronontheweb
Copy link
Member Author

So what I've found via my draft StressSpec thus far is that when the observer ring shifts as a result of normal Akka.Cluster topology changes, old entries for nodes who are no longer part of a particular nodes' observer ring aren't removed from the DefaultFailureDetectorRegistry - thus those Phi values quickly approach infinity as a result of the timestamp delta growing quite large (because those timestamp values are no longer being updated via the HeartbeatSender dropping it from the ring.)

If one of those nodes rotates back into the observation ring after another shift in the cluster (add 2 nodes, shift ring, remove those same 2 nodes, shifts ring back) it's not clear if the heartbeat node ring does the right thing and completely resets the timers under all circumstances. The way the nodes leave the ring may also have to do something with it (i.e. unreachable + downing vs. gracefully leaving may not produce identical changes to the heartbeat node ring.)

I can clearly see the Phi values for these "unmonitored, yet still monitored" nodes clearly in the logs in StressSpec - I'm going to continue to investigate why. But in the meantime, I'm going to submit some additional pull requests to make it easier to troubleshoot and debug these types of issues.

@Aaronontheweb
Copy link
Member Author

Merged in #4934, which includes some fixes that should make it easier to debug heartbeat issues inside the cluster.

@Aaronontheweb
Copy link
Member Author

Finally caught some good logs that illustrate the issue really well, from the StressSpec:


[INFO][4/15/2021 9:25:12 PM][Thread 0065][akka.trttl.gremlin.tcp://StressSpec@localhost:51667/user/result10] [exercise join/remove round 1] completed in [9661] ms
Akka.Cluster.GossipStats
[Monitor] [Subject] [count] [count phi > 1.0] [max phi]
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 8 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 8 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 8 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 8 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 6 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 2 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 1 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 7 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51649 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51650 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51651 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51652 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51653 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51657 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51658 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51660 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51665 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51666 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51667 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51668 10 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51669 0 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51768 9 0 0.00
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 akka.trttl.gremlin.tcp://StressSpec@localhost:51769 0 0 0.00
ClusterStats(gossip, merge, same, newer, older, vclockSize, seenLatest)
akka.trttl.gremlin.tcp://StressSpec@localhost:51649 CurrentClusterStats(15, 0, 13, 1, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51650 CurrentClusterStats(19, 0, 14, 3, 2,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51651 CurrentClusterStats(17, 0, 13, 3, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51652 CurrentClusterStats(16, 0, 13, 3, 0,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51653 CurrentClusterStats(12, 0, 8, 3, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51657 CurrentClusterStats(12, 0, 10, 2, 0,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51658 CurrentClusterStats(21, 0, 18, 3, 0,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51660 CurrentClusterStats(12, 0, 9, 2, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51665 CurrentClusterStats(19, 0, 15, 3, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51666 CurrentClusterStats(10, 0, 7, 3, 0,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51667 CurrentClusterStats(12, 0, 9, 3, 0,3, 7)
akka.trttl.gremlin.tcp://StressSpec@localhost:51668 CurrentClusterStats(13, 0, 9, 3, 1,, )
akka.trttl.gremlin.tcp://StressSpec@localhost:51669 CurrentClusterStats(9, 0, 8, 1, 0,, )
[INFO][4/15/2021 9:25:12 PM][Thread 0118][remoting-terminator] Shutting down remote daemon.
[INFO][4/15/2021 9:25:12 PM][Thread 0118][remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.
[INFO][4/15/2021 9:25:12 PM][Thread 0118][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51657-4] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51657]
[INFO][4/15/2021 9:25:12 PM][Thread 0075][akka.trttl.gremlin.tcp://StressSpec@localhost:51667/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51768-16] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51667]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51768]
[INFO][4/15/2021 9:25:12 PM][Thread 0075][akka.trttl.gremlin.tcp://StressSpec@localhost:51667/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51769-15] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51667]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51769]
[INFO][4/15/2021 9:25:12 PM][Thread 0120][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51651-2] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51651]
[INFO][4/15/2021 9:25:12 PM][Thread 0121][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51667-1] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51667]
[INFO][4/15/2021 9:25:12 PM][Thread 0119][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51669-3] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51669]
[INFO][4/15/2021 9:25:12 PM][Thread 0120][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51658-5] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51658]
[INFO][4/15/2021 9:25:12 PM][Thread 0119][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51666-7] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51666]
[INFO][4/15/2021 9:25:12 PM][Thread 0121][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51652-10] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51652]
[INFO][4/15/2021 9:25:12 PM][Thread 0118][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51650-9] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51650]
[INFO][4/15/2021 9:25:12 PM][Thread 0119][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51660-6] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51660]
[INFO][4/15/2021 9:25:12 PM][Thread 0121][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51668-12] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51668]
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51657
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51658
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51660
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51769
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51650
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51651
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51669
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51652
Removed heartbeats to akka.trttl.gremlin.tcp://StressSpec@localhost:51653
[INFO][4/15/2021 9:25:12 PM][Thread 0119][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51653-11] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51653]
[INFO][4/15/2021 9:25:12 PM][Thread 0118][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51665-14] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51665]
[INFO][4/15/2021 9:25:12 PM][Thread 0119][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51649-13] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51649]
[INFO][4/15/2021 9:25:12 PM][Thread 0121][akka.trttl.gremlin.tcp://StressSpec@localhost:51768/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FStressSpec%40localhost%3A51769-8] Removing receive buffers for [akka.trttl.gremlin.tcp://StressSpec@localhost:51768]->[akka.trttl.gremlin.tcp://StressSpec@localhost:51769]
DeadLetter from [akka://StressSpec/deadLetters] to [akka://StressSpec/system/transports/throttlermanager.$gremlin.tcp$1/throttler8#719344009]: <InboundPayload(size = 4 bytes)>
[INFO][4/15/2021 9:25:12 PM][Thread 0119][remoting-terminator] Remoting shut down.
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51669
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51658
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51657
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51668
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51660
Replacing state with timestamp [66253] with timestamp [67258] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51650
Replacing state with timestamp [66253] with timestamp [67259] for node akka.trttl.gremlin.tcp://StressSpec@localhost:51651
About to process very large timeDiff: 2028 vs mean: 1012 for node akka.trttl.gremlin.tcp://StressSpec@localhost:51768
About to process very large timeDiff: 2028 vs mean: 1012 for node akka.trttl.gremlin.tcp://StressSpec@localhost:51769

About to process very large timeDiff: 2028 vs mean: 1012 for node akka.trttl.gremlin.tcp://StressSpec@localhost:51768
About to process very large timeDiff: 2028 vs mean: 1012 for node akka.trttl.gremlin.tcp://StressSpec@localhost:51769

So 51768 is running inside the same process as akka.trttl.gremlin.tcp://StressSpec@localhost:51667 - and we can see from the logs that 51667 is still alive, but we also saw the remoting terminator shut down 51768 - that's why all of its buffers were removed. However, despite this graceful termination and exit from the cluster 51667 is still tracking heartbeats for 51768. If 51768 were to restart with that same address I think it might be marked as unreachable right away due to how far apart its current timestamp (and it will keep growing larger since it now never gets updated). It looks to me like the cause of this issue is the HeartbeatNodeRing and the DefaultFailureDetectorRegistry<T> getting out of sync with each other.

@Aaronontheweb
Copy link
Member Author

So it looks like we attempt to clean up the DefaultFailureDetectorRegistry<T> in a couple of places when nodes join the cluster:

// remove the node from the failure detector
_cluster.FailureDetector.Remove(node.Address);

and

// for all new joining nodes we remove them from the failure detector
foreach (var node in _latestGossip.Members)
{
if (node.Status == MemberStatus.Joining && !localGossip.Members.Contains(node))
_cluster.FailureDetector.Remove(node.Address);
}

I'm wondering if the problem is that nodes immediately get promoted to WeaklyUp for before their Joining gossip arrives to some Akka.Cluster members and a result, none of these failure detector resets get hit on some of the nodes in the observer ring for the joining nodes.

We should also be cleaning up these failure detector registries when nodes leave the network too, but it's clear from the logs that doesn't happen consistently.

@Aaronontheweb
Copy link
Member Author

I've incorporated this fix into #4899 - which should be available for review soon.

I'm also going to submit a port of akka/akka#29665 in a separate PR - once both of those are merged in I think we can put this issue to bed for now.

@wesselkranenborg
Copy link
Contributor

Good to hear this. Once this is merged I will test the new nightly build.

@Aaronontheweb
Copy link
Member Author

This is ready for review here: #4940

@wesselkranenborg I'd wait until akka/akka#29665 is ported too before we call this issue 100% resolved, but I think this should help.

@Aaronontheweb
Copy link
Member Author

Going to start work on porting akka/akka#29665 tomorrow - going to aim for releasing Akka.NET v1.4.19 sometime early next week ideally.

@Aaronontheweb Aaronontheweb reopened this Apr 17, 2021
@Aaronontheweb
Copy link
Member Author

Keeping this open while I do some additional work on the heartbeat system aimed to resolve some of this.

@Aaronontheweb
Copy link
Member Author

Port for WeaklyUp duration is live #4946 - running the StressSpec now to measure its improvement on convergence speed

@Aaronontheweb
Copy link
Member Author

Possibly related to this issue also: #4963

@Aaronontheweb Aaronontheweb reopened this Apr 21, 2021
@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 21, 2021

Keeping this issue open because based on feedback from users who are running the nightly builds, this is still an issue.

So our remaining leads in terms of what is causing this:

  • Inconsistency with how the cluster handles changes in Membership / Reachability inside Gossip and ClusterEvents - this was exposed in some failed MNTR tests on Bug: ClusterEvent.MemberRemoved can jump from MemberLeaving without being MemberExited #4948. We're currently working on refactoring the Gossip class into a MembershipState class (a port of Refactoring of Gossip class, #23290 akka/akka#23291) which will allow us to exhaustively test the ClusterDaemon's entire graph of possible transitions without actually running Akka.Cluster via an FsCheck model-based test: https://aaronstannard.com/fscheck-property-testing-csharp-part2/ - hoping to have that done tomorrow
  • Due to known inefficiencies in how the DedicatedThreadPool waits for work on newly started / relatively idle systems, we suspect that a large concentration of relatively "unloaded" nodes all running on the same host machine (exactly what you'd expect to find a Kubernetes Node, for instance) sees this "idle CPU" problem transform into an outright "noisy neighbor" problem where "busily waiting" threads end up stealing so much CPU that it delays heartbeats and other time-sensitive messages. Our angle for pursuing this is creating a new System.Threading.Channel<T> based dispatcher that constrains the degree of concurrency of the default /user dispatcher and all of the other dispatchers too, in order to allow fair(er) scheduling on the stand-alone .NET ThreadPool. This is currently be worked on in Introduce ChannelExecutor #4882 with promising results thus far, but we have some more work to do to mitigate the measured throughput impact - we think that's a side effect of how our RemotePingPong benchmark is designed. We will likely ship that ChannelDispatcher as a opt-in dispatcher for users who are running large Akka.NET clusters until we get a better idea of its impact. That way, users who are happy with current Akka.NET performance aren't suddenly affected.
  • Potentially, some Akka.Remote quarantining bugs are at work - we have submitted a patch for these here: Clean up bad outbound ACKs in Akka.Remote #4963

In addition to all of these changes, we've been doing multiple rounds of performance optimization to try to make total Akka.Cluster actor processing time as efficient as possible - we've made some significant gains there so far. Our key measure is the total amount of time it takes the StressSpec to complete its join 2 nodes to 9 stage, which has been decreasing from a high of about 10s down to ~6s since we started work in earnest on it last week. We aim to get these changes implemented and deployed into a v1.4.19 patch as soon as possible.

@Aaronontheweb
Copy link
Member Author

Another possibility worth considering - the scheduler itself: #4640

The HashedWheelTimer is can be a bit hard on the CPU, so having dozens of Akka.NET processes all running out spread out over a small-ish number of virtual CPUs might have some interesting side effects.

@to11mtm
Copy link
Member

to11mtm commented Apr 21, 2021

Another possibility worth considering - the scheduler itself: #4640

The HashedWheelTimer is can be a bit hard on the CPU, so having dozens of Akka.NET processes all running out spread out over a small-ish number of virtual CPUs might have some interesting side effects.

Two things to consider as far as the scheduler and it's behavior as it sits today:

  • The timer logic itself, IDK, something about the for loop in WaitForNextTick makes me feel like we could do better. Also I do wonder whether or not a better structure/pattern exists than ConcurrentQueue for our use case.

  • The use of Fixed rate versus fixed delay; We are using Fixed Rate currently.

    • Let's assume Two Systems are configured to sending out a heartbeat every 1 seconds.
      • We get a pause for 3 seconds on Node A
      • After pause, 3 heartbeats are sent immediately to Node B
      • Phi accrual detector on Node B picks up these three heartbeats from A immediately
      • Phi accrual detector on Node B now has Skewed historical data.

@Aaronontheweb
Copy link
Member Author

Phi accrual detector on Node B now has Skewed historical data.

Yeah that could certainly do it - the burst of accrued messages flying out all at once after a pause (i.e. Kubernetes throttling pods on a busy system)

@Aaronontheweb
Copy link
Member Author

Waiting on confirmation from some groups of users we've been in touch with (via Petabridge) on this issue but I think we've determined that this was mostly an issue with K8s DNS. Although the improvements we made to Akka.Cluster in the interim here certainly don't hurt!

@Aaronontheweb
Copy link
Member Author

Reports are good - looks like all of the work we did here paid off.

@wesselkranenborg
Copy link
Contributor

Before the changes made as part of this issue, we couldn't stable spin up an Akka.Cluster in AKS with around 5-10 nodes without a lot of nodes being quarantined the whole time and not forming a cluster again. After these changes, I was able to create a stable cluster with around 50 Akka.Cluster nodes with almost no quarantines. Only a few quarantines happened during a load test but after that, the cluster stabilized itself again. So I can indeed confirm that all the work you did here did definitely pay off.

Maybe one small addition to the k8s DNS you mentioned. We were not facing that issue in our cluster. We already use NodeLocal DNSCache because we earlier had DNS issues and this solved a lot of our performance and DNS resolution issues.

Thanks for the hard work!

@Aaronontheweb
Copy link
Member Author

That's good to know - glad the work we did here paid off. We'll get v1.4.19 released soon.

This was referenced Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment