-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node hung in "joining" status #4290
Comments
An update, We switched the AKS Kubernetes networking off the Advanced CNI and onto K8s standard networking which resolved a separate issue. We are monitoring our clusters to see if this issue returns. |
I can confirm now this is still happening. I have more verbose logs of the event, and I'm attaching it. |
2020-03-09 17:53:31.5231|DEBUG|Resolve of path sequence [/system/cluster/core/daemon/joinSeedNodeProcess-2#199240527] failed |
Looks like that eventually succeeded in the end - but I'll double check that. Is this a seed node logging this issue or a joining node? |
Right. That's what is odd. It is a joining node, not a seed node. Here's a little more on it. The pattern goes like this. (Still happening, but I have mitigating controls for the moment while I'm still looking for it.) Node comes online. Eventually, every node except itself is "unavailable", and if we look, it's stuck as "Joining". It continues spamming all the other nodes with gossip which give the Irrevocably failed, must be restarted, yada yada. I have a node in my environment right now that continues to either do this, OR if it gets to join, it crashes within a couple minutes, but the same behavior is observed. Does that make sense? |
What's the load looking like in the K8s cluster when this happens? |
I have 6 int the node pool. CPU average around 32%. Other stats are nominal. Any specific stats help? |
FYI, I have one in this state right this minute. |
if there's any K8s limits on the number of open connections in your cluster, that would be interesting data to have too. But looks like it's not a load issue
…Sent from my iPhone
On Mar 10, 2020, at 6:39 PM, Drew Jenkel ***@***.***> wrote:
I have 6 int the node pool.
CPU average around 32%. Other stats are nominal.
Any specific stats help?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I'm looking into if K8s does something CPU or Connection by default at the moment. |
I’m starting to wonder if Azure/AKS#1373 is the underlying issue, and we are seeing network and dns errors. I’ve seen a number of dns service name unknown exceptions |
@drewjenkel should we mark this as closed then? |
We can close. I have an active ticket open with Microsoft. I believe I have this 99% figured out. If it turns up to be an Akka Issue, I can reopen, or open a new one and reference. I don't want to clog your issues. |
Thanks! Please let us know. |
Environment Details:
Akka.Net Versions:
Akka.Cluster.Tools Version="1.3.17"
Akka.DI.AutoFac Version="1.3.9"
Akka.DI.Core Version="1.3.17"
Akka.Logger.NLog Version="1.3.4"
Autofac.Extensions.DependencyInjection Version="5.0.1"
Petabridge.Cmd.Cluster Version="0.6.3"
Petabridge.Cmd.Remote Version="0.6.3"
Petabridge.Tracing.ApplicationInsights Version="0.1.3"
Other Libraries:
Microsoft.Extensions.Hosting Version="2.1.0"
.Net Core 2.1 .
Deployed via Docker Container for Kubernetes on AKS
PBM embedded into Docker Image
Active Configuration: akka {
actor {
provider ="Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]
remote {
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
port = 7770
hostname = 0.0.0.0
public-hostname = "lighthouse-0.lighthouse"
enforce-ip-family = true
dns-use-ipv6 = false
}
log-remote-lifecycle-events=INFO
}
cluster {
seed-nodes=["akka.tcp://andor@lighthouse-0.lighthouse:7770", "akka.tcp://andor@lighthouse-1.lighthouse:7770", "akka.tcp://andor@lighthouse-2.lighthouse:7770"]
roles=["lighthouse"]
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-referee
keep-referee {
address = "akka.tcp://andor@lighthouse-0.lighthouse:7770"
down-all-if-less-than-nodes = 1
}
}
}
petabridge.cmd{
host = "0.0.0.0"
port = 9110
log-palettes-on-startup = on
}
Environment Size: 60-70 pods at a given time.
On occasion, I will have a pod enter a state where it sees itself a joining, and gets hung. I have attached the logs for the affected service, a lighthouse node, and the cluster leader.
We see messages randomly after a node gets "hung" that it is sending gossip intended for a different node that it receives.
It appears that the node is downed by the leader, but because the hung node is "joining" it's not downing itself, and becomes confused. It never leaves the cluster, and continues to gossip.
Trying to PBM (into the infected node), I cannot down, join, exit, or anything. The only way to recover has been to delete the pod or scale down the statefulset.
akkalogs.zip
I believe this is similar to Issues:
#2584
#3274
#2346
The text was updated successfully, but these errors were encountered: