Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node hung in "joining" status #4290

Closed
drewjenkel opened this issue Mar 3, 2020 · 15 comments
Closed

Node hung in "joining" status #4290

drewjenkel opened this issue Mar 3, 2020 · 15 comments

Comments

@drewjenkel
Copy link

Environment Details:
Akka.Net Versions:
Akka.Cluster.Tools Version="1.3.17"
Akka.DI.AutoFac Version="1.3.9"
Akka.DI.Core Version="1.3.17"
Akka.Logger.NLog Version="1.3.4"
Autofac.Extensions.DependencyInjection Version="5.0.1"
Petabridge.Cmd.Cluster Version="0.6.3"
Petabridge.Cmd.Remote Version="0.6.3"
Petabridge.Tracing.ApplicationInsights Version="0.1.3"

Other Libraries:
Microsoft.Extensions.Hosting Version="2.1.0"

.Net Core 2.1 .

Deployed via Docker Container for Kubernetes on AKS

PBM embedded into Docker Image

Active Configuration: akka {
actor {
provider ="Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]
remote {
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
port = 7770
hostname = 0.0.0.0
public-hostname = "lighthouse-0.lighthouse"
enforce-ip-family = true
dns-use-ipv6 = false
}
log-remote-lifecycle-events=INFO
}
cluster {
seed-nodes=["akka.tcp://andor@lighthouse-0.lighthouse:7770", "akka.tcp://andor@lighthouse-1.lighthouse:7770", "akka.tcp://andor@lighthouse-2.lighthouse:7770"]
roles=["lighthouse"]
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-referee
keep-referee {
address = "akka.tcp://andor@lighthouse-0.lighthouse:7770"
down-all-if-less-than-nodes = 1
}
}
}
petabridge.cmd{
host = "0.0.0.0"
port = 9110
log-palettes-on-startup = on
}

Environment Size: 60-70 pods at a given time.

On occasion, I will have a pod enter a state where it sees itself a joining, and gets hung. I have attached the logs for the affected service, a lighthouse node, and the cluster leader.

We see messages randomly after a node gets "hung" that it is sending gossip intended for a different node that it receives.

It appears that the node is downed by the leader, but because the hung node is "joining" it's not downing itself, and becomes confused. It never leaves the cluster, and continues to gossip.

Trying to PBM (into the infected node), I cannot down, join, exit, or anything. The only way to recover has been to delete the pod or scale down the statefulset.
akkalogs.zip

I believe this is similar to Issues:

#2584
#3274
#2346

@drewjenkel
Copy link
Author

An update, We switched the AKS Kubernetes networking off the Advanced CNI and onto K8s standard networking which resolved a separate issue. We are monitoring our clusters to see if this issue returns.

@drewjenkel
Copy link
Author

I can confirm now this is still happening. I have more verbose logs of the event, and I'm attaching it.

@drewjenkel
Copy link
Author

@drewjenkel
Copy link
Author

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed
2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed
2020-03-09 17:53:31.5790| INFO|Cluster Node [akka.tcp://andor@andor-patient-data-0.andor-patient-data:7770] - Welcome from [akka.tcp://andor@lighthouse-0.lighthouse:7770]

@Aaronontheweb
Copy link
Member

2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] faile

Looks like that eventually succeeded in the end - but I'll double check that. Is this a seed node logging this issue or a joining node?

@drewjenkel
Copy link
Author

Right. That's what is odd. It is a joining node, not a seed node.

Here's a little more on it. The pattern goes like this. (Still happening, but I have mitigating controls for the moment while I'm still looking for it.)

Node comes online.
Node gets welcome message.
Everything is good for a few seconds, then one at a time, I get disassociated message, followed by a change of node leader, until it does that with every other node in the cluster. (The other nodes don't see a leader change. )

Eventually, every node except itself is "unavailable", and if we look, it's stuck as "Joining". It continues spamming all the other nodes with gossip which give the Irrevocably failed, must be restarted, yada yada.

I have a node in my environment right now that continues to either do this, OR if it gets to join, it crashes within a couple minutes, but the same behavior is observed.

Does that make sense?

@Aaronontheweb
Copy link
Member

What's the load looking like in the K8s cluster when this happens?

@drewjenkel
Copy link
Author

I have 6 int the node pool.

CPU average around 32%. Other stats are nominal.

Any specific stats help?

@drewjenkel
Copy link
Author

FYI, I have one in this state right this minute.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 10, 2020 via email

@drewjenkel
Copy link
Author

I'm looking into if K8s does something CPU or Connection by default at the moment.

@drewjenkel
Copy link
Author

I’m starting to wonder if Azure/AKS#1373 is the underlying issue, and we are seeing network and dns errors. I’ve seen a number of dns service name unknown exceptions

@Aaronontheweb Aaronontheweb modified the milestones: 1.4.2, 1.4.3 Mar 13, 2020
@Aaronontheweb
Copy link
Member

@drewjenkel should we mark this as closed then?

@drewjenkel
Copy link
Author

We can close. I have an active ticket open with Microsoft. I believe I have this 99% figured out. If it turns up to be an Akka Issue, I can reopen, or open a new one and reference. I don't want to clog your issues.

@Aaronontheweb Aaronontheweb removed this from the 1.4.3 milestone Mar 17, 2020
@Aaronontheweb
Copy link
Member

Thanks! Please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants