-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During rolling update in Kubernetes data in KV store is lost in some nodes #4351
Comments
When doing restarts, do you wait for /healthz to return ok before moving to next server? |
Also be good to make sure we can reproduce from top of main, the 2.9.21 RC candidate. |
The readiness check waits for /healthz to return ok, and before that we don't server any traffic to that nats server. |
For the direct GETS, which all stream peers can participate in yes once they think they are caught up they can participate.. |
One thing that you could try too is to bump the
|
Will try the change in the initialDelaySeconds for the next restart next week, we currently want to restart as few times as possible as only the production system hast currently the load where we can monitor this with good data. We are trying to setup another test instance where we can reproduce this behavior, but until we have it finished we have to wait. |
Just tested the |
I've tried to reproduce the behavior with a small go client, but didn't succeed. We can reproduce the behavior with our full java application in a dev environment, but couldn't yet isolate the root cause. But we had it now multiple times, that the nats cluster synchronized itself after our application was restarted. Is there anything (open connections, special commands) which can prevent the cluster from syncronizing? Especially in the java library? |
I've seen, that in the helm chart the service has set Could this be the reason for this behavior? Since we didn't create an additional service, our client probably connected to a not yet ready nats pod through this service, this would also render the We'll try to upgrade to the latest chart, but this will take some time as we need to thoroughly test the changes before going life. |
Keep us posted, you should not be losing any messages of course. |
We appear to have come across a very similar situation, in which an upgrade to NATS 2.10.9 has resulted in a loss of the complete history of messages on two R3 streams. Initial investigations suggest this is due to the server nodes being killed during recovery by a failing health check, but we're yet to determine a minimal reproduction (minus Nomad and CSI plugins for persistent disk storage in GCP) which demonstrates the deletion of the message block files in this case. Any thoughts on additional avenues to explore or whether this sounds like a plausible explanation would be greatly appreciated. |
@mpwalkerdine were you using interest streams? |
Hi @wallyqs - no, permanent streams with durable consumers. A couple of potentially interesting things appeared in our logs around this time too, |
A quick update for any future readers - we've narrowed the issue down somewhat and it appears to be related to the NATS disks being pulled at inappropriate times by the CSI plugin, leading to corruption and the log line during next startup: |
@wallyqs FYI it looks like one or more of the fixes in 2.10.10 has resolved our issue - I can no longer reproduce the deletion on startup issue, no orphaned stream logs and no propagation of that empty state to other nodes in the cluster 🤞 |
@mpwalkerdine thanks for the update! |
Looks like I spoke too soon - more testing shows that one of the nodes has lost its state on disk, and if it becomes a cluster leader that stream reports 0 messages. The files on disk are indeed empty: -rw-r----- 1 root root 0 Feb 7 09:33 1.blk
-rw------- 1 root root 34 Feb 7 11:53 index.db
|
For me, the issue seems to be solved with the latest version 2.10.11. We are not able to reproduce it anymore, so it could be closed |
@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too? |
I'll retest against 2.10.11 and report back. Our issue seems to be caused by some pretty unreasonable behaviour in our CSI disk management, so it is perhaps not unexpected for NATS to take issue with that. I did expect the faulty node to recover from the rest of the cluster though. |
@Jarema the behaviour has changed, although there still seems to be a discrepancy between nodes when they come back up: Healthy node as leader:
Faulty node as leader:
So they're no longer losing all their state, just some of it. Pertinent logs appear to be:
|
Another quick update - we've been running 2.10.11 in production for a week or so now and although it appears the state is no longer lost completely, it does still break the ability to publish messages when they're sent to a node with bad state. Our workaround for this is to scale down to a healthy node and up again and recreate all the consumers (we've got separate idempotency protection). |
Fixed via #5821, thanks a lot @yoadey for the reproduction at https://github.com/yoadey/nats-test that was very useful. |
We are having a problem with our NATS cluster in a Kubernetes cluster, were the entries in KV stores differ betwenn the nats nodes after server restarts, leaving us with data loss in some cases. As the Kubernetes cluster gets updated every week, this issue was now seen several times, although we cannot reproduce it in a stable manner.
Our current workaround is, that if we don't get the data in the first run, we do the same request up to 5 times and then we normally get the wanted data due to the server request routing.
Defect
nats-server -DV
outputVersions of
nats-server
and affected client libraries used:(We used a custom build from 2.9.20, where we included this specific fix: #4320, Problem also occured with offical 2.9.19 nats-server container)
All client libraries are affected.
OS/Container environment:
Kubernetes containerd
Steps or code to reproduce the issue:
Have the following cluster setup:
nats kv history <keyvaluestore> --trace
(trace option shows the server)Expected result:
Requesting multiple times all entries in the same kv store should always return the same list.
Actual result:
Sometimes after restarts the result differs between the nodes and stay different.
Additional details
This does not occur after every restart, seem also to depend a little bit on the downtime from the node and for some KVs it seem to heal itself after some time (sometimes minutes, sometime hours, sometime days).
Go code for testscript:
Following script executes listing the entries in the KV store multiple times against the KV store and prints the result.
As the NATS server routes the traffic by itself, we don't need to switch servers.
The text was updated successfully, but these errors were encountered: