-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster times out new connections #1150
Comments
This is a larger log snippet including info about which cluster node was the source of each line Timestamp 2021/01/23 20:22:47 is when we begin to manually restart the nodes. |
@jason-magnetic-io Thank you for using NATS Streaming and sorry about your latest troubles. Just to be clear: is the issue solely about the "timed out due to heartbeats" errors or are you say also that new connections cannot be made? You mentioned that you do not notice any CPU usage increase that would explain this. The "timed out on heartbeats" could be the result of applications being stopped without them having a chance to call the connection "Close()" API. As a refresher: the NATS Streaming server is not a server, but a client to core NATS server. Therefore, clients are not directly TCP connecting to the streaming "server". The streaming server knows that client are "connected" based on heartbeats, and so having client send a close protocol for the server to remove the client is required. Without the close, the server relies on heartbeats and after a configurable number of missed responses, it will consider the client connection lost. If new clients cannot connect at that same time, it could be really some networking issue that would explain both type of failures. Since I see that you are running the server with monitoring port enabled, one thing we could try, when the server is in this situation of failing clients due to HB timeout and rejecting new clients, is to capture the stacks of the leader process. It would be hitting the monitoring endpoint: |
Thanks for the To clarify the language, at the time of the issue, there were:
We ruled out it being a network issue. The 6 clients within the same Kubernetes cluster were unaffected. Connections from within the same data centre and from a remote data centre were able to connect and authorise with NATS but could not connect with the NATS streaming cluster despite the NATS and NATS streaming Kubernetes Pods sharing the same VMs. Regarding CPU and memory, the CPU load for each NATS streaming instance was between 0.1 and 0.2 vCPU and the memory usage was flat at 28MB per instance. We have experienced issues in the past with connections not being cleaned up. All the clients take steps to actively disconnect including intercepting the container shutdown hooks. We have experience of when the channel, subscription, message and memory limits are exceeded. That was not the behaviour we observed in this case. |
Were those 2 groups of clients located in the same "region" or connecting to the same server?
Well, it depends. Say you have a NATS core cluster of 3 servers called N1, N2 and N3. Now say that you have a streaming cluster consisting of S1, S2 and S3 which connect to the NATS cluster. They can have connected/reconnected to any NATS server in the cluster, that is, it is not guaranteed that S1 is connected to N1, etc.. When a client connects, it connects to NATS, so it is possible that say it connects to N3, but the Streaming server leader is S1 and for instance this one connects to N1. Your client will connect/authorized fine when connecting to N3 but the streaming connection request has to reach S1 (and be replicated to S2 and S3). So it is not out of the question that you still have network issues even if a client can TCP connect successfully to its close NATS core server.
See above. I am not saying that there is a network issue, but we can't rule that out. But it could be that somehow the server is locking up or being deadlocked, but then I would expect leadership to be lost, etc.. so next time the problem occurs, try capturing the Thanks! |
First, I want to say that we've been using NATS streaming for nearly 2 years without issues.
Last week we observed timeouts when new clients tried to connect to stan. Approximately, 3 out of 4 connection attempts were unsuccessful.
The existing clients remained connected and were unaffected.
The only way we could resolve the issue was by restarting the cluster with empty file stores and reconnecting all the clients.
Versions: nats-streaming-server version 0.19.0; nats-server: v2.1.9; nats-account-server: 0.8.4
The cluster is running on Kubernetes on Google Cloud GKE using this configuration:
There were no unusual spikes in memory or CPU usage.
These are the logs for stan-o, stan-1 and stan-2 building up to the issue:
The text was updated successfully, but these errors were encountered: