Node hung in "joining" status #4290

drewjenkel · 2020-03-03T21:06:30Z

Environment Details:
Akka.Net Versions:
Akka.Cluster.Tools Version="1.3.17"
Akka.DI.AutoFac Version="1.3.9"
Akka.DI.Core Version="1.3.17"
Akka.Logger.NLog Version="1.3.4"
Autofac.Extensions.DependencyInjection Version="5.0.1"
Petabridge.Cmd.Cluster Version="0.6.3"
Petabridge.Cmd.Remote Version="0.6.3"
Petabridge.Tracing.ApplicationInsights Version="0.1.3"

Other Libraries:
Microsoft.Extensions.Hosting Version="2.1.0"

.Net Core 2.1 .

Deployed via Docker Container for Kubernetes on AKS

PBM embedded into Docker Image

Active Configuration: akka {
actor {
provider ="Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]
remote {
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
port = 7770
hostname = 0.0.0.0
public-hostname = "lighthouse-0.lighthouse"
enforce-ip-family = true
dns-use-ipv6 = false
}
log-remote-lifecycle-events=INFO
}
cluster {
seed-nodes=["akka.tcp://andor@lighthouse-0.lighthouse:7770", "akka.tcp://andor@lighthouse-1.lighthouse:7770", "akka.tcp://andor@lighthouse-2.lighthouse:7770"]
roles=["lighthouse"]
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-referee
keep-referee {
address = "akka.tcp://andor@lighthouse-0.lighthouse:7770"
down-all-if-less-than-nodes = 1
}
}
}
petabridge.cmd{
host = "0.0.0.0"
port = 9110
log-palettes-on-startup = on
}

Environment Size: 60-70 pods at a given time.

On occasion, I will have a pod enter a state where it sees itself a joining, and gets hung. I have attached the logs for the affected service, a lighthouse node, and the cluster leader.

We see messages randomly after a node gets "hung" that it is sending gossip intended for a different node that it receives.

It appears that the node is downed by the leader, but because the hung node is "joining" it's not downing itself, and becomes confused. It never leaves the cluster, and continues to gossip.

Trying to PBM (into the infected node), I cannot down, join, exit, or anything. The only way to recover has been to delete the pod or scale down the statefulset.
akkalogs.zip

I believe this is similar to Issues:

#2584
#3274
#2346

drewjenkel · 2020-03-06T18:30:17Z

An update, We switched the AKS Kubernetes networking off the Advanced CNI and onto K8s standard networking which resolved a separate issue. We are monitoring our clusters to see if this issue returns.

drewjenkel · 2020-03-09T18:22:04Z

I can confirm now this is still happening. I have more verbose logs of the event, and I'm attaching it.

drewjenkel · 2020-03-09T18:24:13Z

cluster_error_verbose.txt

drewjenkel · 2020-03-09T18:54:21Z

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed
2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed
2020-03-09 17:53:31.5790| INFO|Cluster Node [akka.tcp://andor@andor-patient-data-0.andor-patient-data:7770] - Welcome from [akka.tcp://andor@lighthouse-0.lighthouse:7770]

Aaronontheweb · 2020-03-10T23:28:46Z

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] faile

Looks like that eventually succeeded in the end - but I'll double check that. Is this a seed node logging this issue or a joining node?

drewjenkel · 2020-03-10T23:33:57Z

Right. That's what is odd. It is a joining node, not a seed node.

Here's a little more on it. The pattern goes like this. (Still happening, but I have mitigating controls for the moment while I'm still looking for it.)

Node comes online.
Node gets welcome message.
Everything is good for a few seconds, then one at a time, I get disassociated message, followed by a change of node leader, until it does that with every other node in the cluster. (The other nodes don't see a leader change. )

Eventually, every node except itself is "unavailable", and if we look, it's stuck as "Joining". It continues spamming all the other nodes with gossip which give the Irrevocably failed, must be restarted, yada yada.

I have a node in my environment right now that continues to either do this, OR if it gets to join, it crashes within a couple minutes, but the same behavior is observed.

Does that make sense?

Aaronontheweb · 2020-03-10T23:35:06Z

What's the load looking like in the K8s cluster when this happens?

drewjenkel · 2020-03-10T23:39:41Z

I have 6 int the node pool.

CPU average around 32%. Other stats are nominal.

Any specific stats help?

drewjenkel · 2020-03-10T23:42:07Z

FYI, I have one in this state right this minute.

Aaronontheweb · 2020-03-10T23:42:38Z

if there's any K8s limits on the number of open connections in your cluster, that would be interesting data to have too. But looks like it's not a load issue

…

Sent from my iPhone

On Mar 10, 2020, at 6:39 PM, Drew Jenkel ***@***.***> wrote: I have 6 int the node pool. CPU average around 32%. Other stats are nominal. Any specific stats help? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

drewjenkel · 2020-03-10T23:46:51Z

I'm looking into if K8s does something CPU or Connection by default at the moment.

drewjenkel · 2020-03-11T01:17:09Z

I’m starting to wonder if Azure/AKS#1373 is the underlying issue, and we are seeing network and dns errors. I’ve seen a number of dns service name unknown exceptions

Aaronontheweb · 2020-03-17T16:16:30Z

@drewjenkel should we mark this as closed then?

drewjenkel · 2020-03-17T16:18:11Z

We can close. I have an active ticket open with Microsoft. I believe I have this 99% figured out. If it turns up to be an Akka Issue, I can reopen, or open a new one and reference. I don't want to clog your issues.

Aaronontheweb · 2020-03-17T16:29:47Z

Thanks! Please let us know.

Aaronontheweb added akka-cluster potential bug labels Mar 3, 2020

Aaronontheweb added this to the 1.4.1 and Later milestone Mar 3, 2020

Aaronontheweb assigned Arkatufus Mar 10, 2020

Aaronontheweb modified the milestones: 1.4.2, 1.4.3 Mar 13, 2020

drewjenkel closed this as completed Mar 17, 2020

Aaronontheweb removed this from the 1.4.3 milestone Mar 17, 2020

Aaronontheweb unassigned Arkatufus Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node hung in "joining" status #4290

Node hung in "joining" status #4290

drewjenkel commented Mar 3, 2020

drewjenkel commented Mar 6, 2020

drewjenkel commented Mar 9, 2020

drewjenkel commented Mar 9, 2020

drewjenkel commented Mar 9, 2020

Aaronontheweb commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

Aaronontheweb commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

Aaronontheweb commented Mar 10, 2020 via email

drewjenkel commented Mar 10, 2020

drewjenkel commented Mar 11, 2020

Aaronontheweb commented Mar 17, 2020

drewjenkel commented Mar 17, 2020

Aaronontheweb commented Mar 17, 2020

Node hung in "joining" status #4290

Node hung in "joining" status #4290

Comments

drewjenkel commented Mar 3, 2020

drewjenkel commented Mar 6, 2020

drewjenkel commented Mar 9, 2020

drewjenkel commented Mar 9, 2020

drewjenkel commented Mar 9, 2020

Aaronontheweb commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

Aaronontheweb commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

drewjenkel commented Mar 10, 2020

Aaronontheweb commented Mar 10, 2020 via email

drewjenkel commented Mar 10, 2020

drewjenkel commented Mar 11, 2020

Aaronontheweb commented Mar 17, 2020

drewjenkel commented Mar 17, 2020

Aaronontheweb commented Mar 17, 2020