-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jetstream] Stream and consumer went out of sync after rolling restart of NATS servers [v2.10.6, v2.10.7] #4875
Comments
You could try to make sure the leader is correct and scale to R1 and back up to R3. |
Yea, that's a good idea to try to bring the stream and consumer back in sync without losing the state. Since in my case the leader (nats-0) had sequence number reset while the other two replicas still have the right sequence number, maybe what could have been done is to Sorry, as the broken stream was in critical environment I had to mitigate the issue by dropping and recreating the consumer. Can't verify other mitigations. @derekcollison : are you aware of any issue that could cause the stream sequence number to go out of sync on the first place? I do use ephemeral storage ( |
Yes moving the leader might have helped. We would need to look closely at your setup and upgrade procedure to properly triage. |
I think the impact of this issue is concerning. Would be great if it can be addressed. I'm happy to provide all the deployment setups for troubleshooting. Though I'm not confident if this issue can be reproduced though, as I feel it's a race condition, something like the leader was elected to a new node before it's in sync with the cluster. Please let me know, thanks. |
Did more testing in a test environment with the same nats setup, found that this issue isn't related to NATs version upgrade. A rolling restart of NATs statefulset could trigger this issue on To reproduce this issue:
After above, the stream and its consumer is out of sync, that messages published to the stream would increase stream sequence number on the bad leader node only, and the consumer never get the message. @derekcollison I did capture the |
We do not recommend ephemeral storage in general, however, if you decide to use it do you make sure that healthz returns ok from a restarted / updated server before moving to then next one? |
We are using the latest nats helm chart (
So |
It could, since healthz could possibly return ok before it knows certain aspects are behind. Keep us posted. |
This sounds very similar to the problem we have: #4351 |
Indeed. and nice theory about the @yoadey : did you test your theory about |
Just moved all nats clients to the ClusterIP SVC (well except NATs itself which uses headless SVC for discovery as expected). Unfortunately issue is still reproducible that after a rolling restart one of the node (kubernetes-nats-0 again) had its stream sequence numbers reset to zero again. |
Maybe @wallyqs could give his opinions. |
Hi @jzhn, I've tested it and it definitely wasn't the headless service, have tested already with the latest Helm chart and latest version and still can reproduce the issue. |
@derekcollison @wallyqs : here are the I'm not sure if below looks interesting to you:
Wonder if stall detection would be a source of race condition. Logs(stream name is redacted)
|
@wallyqs any updates from your side here? |
Observed behavior
In a Kubernetes statefulset deployment of NATs cluster, I have a simple 3-replica Interest-based stream, that has a single 3-replica consumer.
After an rolling update deployment that upgrades the NATs cluster from v2.10.6 to v2.10.7, the stream and consumer went into unrecoverable bad state, that:
nats-0
, which is the stream leader at the time, got its stream seq numbers reset to 0, whilenats-1
andnats-2
kept the previous stream seq numbers (~23K)nats-0
sees seq numbers increasing.nats-1
andnats-2
have seq numbers stuck still.I've done several attempts to fix:
Rolling restart the nats server
Rolling restart my consumer application
Delete and recreate consumer
Expected behavior
Throughout rolling update and version upgrade,
Server and client version
Server: upgraded from
v2.10.6
tov2.10.7
Client: jnats (java)
2.17.1
Host environment
emptyDir
)Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: