-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051
Comments
Thank you for the report, I will look into that. |
@kozlovic do you know what could cause that, and what I could possibly do to mitigate this? For example, is it possible to connect a subscriber to a queue and make sure that existing/old messages are acked and removed from the file store? It appears that old messages are being stored until capacity is reached. |
I did not have a chance to investigate yet.
No it is not. As explained here, in NATS Streaming, messages are stored regardless of interest and messages are removed based on limits, not when messages are consumed. You could recreate your subscriptions with a more recent start position (either sequence if you have an idea of what the last sequence was, or based on time): https://docs.nats.io/developing-with-nats-streaming/receiving. |
I am sorry but I am unable to reproduce or explain what has happened. I am just wondering regarding the "delivery at sequence 1" if it would be possible that your applications got their subscriptions recreated with "deliver all available" or sequence 1 following a connection lost notification for instance? In other words, could it be that your apps have recreated their subscription starting from scratch? The reason I ask is that even in the weird situation where nats-streaming-0 became leader with a newer term than the other ones, why would the state of the existing subscriptions be lost? And if they were, it would not have been able to deliver at all since the subscriptions would be not known to that server. Any specific cluster configuration that you would want to share so I could try with those to see if that helps reproduce the issue? |
const nats = await natsConfig.connection.get();
const opts = nats.subscriptionOptions();
opts.setDeliverAllAvailable();
opts.setMaxInFlight(maxInFlight);
opts.setAckWait(ackTimeout);
opts.setManualAckMode(true);
opts.setDurableName(durableName);
logger.info(
{ durableName, subject, ackTimeout, maxInFlight },
`Subscribing to durable queue ${subject}`,
);
const durableSub: STAN.Subscription & { ackInbox?: string } = nats.subscribe(subject, `${subject}-workers`, opts); We've had subscribers join and leave with these options before and they don't start with sequence ID 1, they start with un-acked messages for the durable queue group. Is I see this in the docs, so maybe I shouldn't be sending a start position at all?
|
That is fine. The start position for a durable subscription is used only when the durable is created for the first time. I was asking about the streaming cluster configuration. Is there any non-basic configuration options that you may have set? |
Config file is:
Anything stand out as odd? It sure seems like the start position/index or ack state of some message subjects (not all) were lost in the failover. |
Still no luck finding what the problem is. Could you please make sure that there is no chance that you have another cluster with the same cluster-id that for some reason would have suddenly be connected to the prod cluster? |
No other clusters or any interesting topologies, the cluster is network isolated using Kubernetes network policies such that the 3 nodes can only communicate with each other and with subscribers pods have the correct label. |
@kozlovic could a corrupt state in the file store be the cause? It looks like a network blip caused the leader to fail over back from nats-streaming-0 to nats-streaming-1 and more old messages (8 to 10 months old) were replayed. A mitigation we implemented to ack and ignore old messages prevented them from being replayed, but still this is extremely concerning. With high reliability, a failover in our nats streaming cluster causes the durable queue groups position to reset back to message position 1. |
@AaronFriel I am very sorry that I could not find the reason for this issue. It seem strange to me that if a node incorrectly becomes leader (as described in the first part of the issue), it would not have the state of old subscriptions. To become leader means that it has higher term and store log index..
You should be able then to find in the trace if this subscription was created, etc.. before that happened. Alternatively, we may get in a google hangout or zoom meeting to try to see if we could debug this a bit more (without you having to send the logs, etc..). |
@kozlovic We haven't changed our subscriptions recently, that's all version controlled and I can see that nothing has changed there in many weeks. Happy to hop on a video chat, can you send me an email to aaron@twochairs.com? |
@AaronFriel have you managed to resolve your issue? We are currently experiencing very similar issue, receiving messages acknowledged days ago. We recently correlated this with server exiting and logs like
|
@piotrlipiarz-ef We have yet to diagnose the issue. |
We have the same issue. When the leader changed, the subscription lost in client side. The client side need to be restarted to resume the subscription. |
It appears a network partition or extremely high latency resulted in the leader failing to reach a follower. (I think that the CPU load on a server was near 100% and nats-streaming was failing to get time slices, which resulted in a degradation similar to a network partition.)
The above messages repeated for about 45 seconds until the pattern changed with
a message logged from nats-streaming-0:
Then, nats-streaming-0 tried to start a leadership election, and the
partition began healing. 1 and 2 both rejected the vote request and
stated the leader is currently
nats-streaming-1
:Then something strange occurs, and nats-streaming-0 insists it has a newer term than 1, and ends up taking over!
At this point,
nats-streaming-0
has taken over and begins delivering messages to clients starting with sequence id 1!Context:
docker.io/nats-streaming:0.17.0
docker.io/bitnami/nats:2.1.6
Expected behavior
Even in the case of a network partition, a member rejoining the cluster should:
Why did
nats-streaming-0
start replaying messages from 2019?Actual behavior
A network partition resulted in an isolated node rejoining and:
The text was updated successfully, but these errors were encountered: