Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051

AaronFriel · 2020-06-01T21:24:22Z

It appears a network partition or extremely high latency resulted in the leader failing to reach a follower. (I think that the CPU load on a server was near 100% and nats-streaming was failing to get time slices, which resulted in a degradation similar to a network partition.)

nats-streaming-1: [1] 2020/05/31 19:02:15.974370 [WRN] STREAM: raft: Failed to contact nats-streaming-0 in 5.000272385s
nats-streaming-1: [1] 2020/05/31 19:04:07.277293 [WRN] STREAM: raft: Failed to contact nats-streaming-0 in 5.000305251s
nats-streaming-1: [1] 2020/05/31 19:04:12.030348 [WRN] STREAM: raft: Failed to contact nats-streaming-0 in 9.754040241s
nats-streaming-1: [1] 2020/05/31 19:04:12.626648 [INF] STREAM: raft: aborting pipeline replication to peer {Voter nats-streaming-0 namespace.nats-streaming-0.namespace}
nats-streaming-1: [1] 2020/05/31 19:04:15.104260 [ERR] STREAM: raft: Failed to AppendEntries to {Voter nats-streaming-0 namespace.nats-streaming-0.namespace}: nats: timeout
nats-streaming-1: [1] 2020/05/31 19:04:17.752662 [ERR] STREAM: raft: Failed to heartbeat to namespace.nats-streaming-0.namespace: nats: timeout

The above messages repeated for about 45 seconds until the pattern changed with
a message logged from nats-streaming-0:

nats-streaming-0: [ERR] STREAM: raft-net: Failed to flush response: write to closed conn

Then, nats-streaming-0 tried to start a leadership election, and the
partition began healing. 1 and 2 both rejected the vote request and
stated the leader is currently nats-streaming-1:

2020-05-31T19:05:08.890568417Z nats-streaming-0 [1] 2020/05/31 19:05:08.890506 [DBG] STREAM: raft-net: namespace.nats-streaming-0.namespace accepted connection from: namespace.nats-streaming-1.namespace
2020-05-31T19:05:08.890668504Z nats-streaming-2 [1] 2020/05/31 19:05:08.890481 [WRN] STREAM: raft: Rejecting vote request from namespace.nats-streaming-0.namespace since we have a leader: namespace.nats-streaming-1.namespace
2020-05-31T19:05:08.928478050Z nats-streaming-1 [1] 2020/05/31 19:05:08.928281 [WRN] STREAM: raft: Rejecting vote request from namespace.nats-streaming-0.namespace since we have a leader: namespace.nats-streaming-1.namespace
2020-05-31T19:05:10.136827516Z nats-streaming-0 [1] 2020/05/31 19:05:10.136465 [DBG] STREAM: raft-net: namespace.nats-streaming-0.namespace accepted connection from: namespace.nats-streaming-1.namespace

Then something strange occurs, and nats-streaming-0 insists it has a newer term than 1, and ends up taking over!

2020-05-31T19:05:10.137437331Z nats-streaming-1 [1] 2020/05/31 19:05:10.137275 [ERR] STREAM: raft: peer {Voter nats-streaming-0 namespace.nats-streaming-0.namespace} has newer term, stopping replication
2020-05-31T19:05:10.137492052Z nats-streaming-1 [1] 2020/05/31 19:05:10.137350 [INF] STREAM: raft: Node at namespace.nats-streaming-1.namespace [Follower] entering Follower state (Leader: "")
2020-05-31T19:05:10.137499170Z nats-streaming-1 [1] 2020/05/31 19:05:10.137362 [INF] STREAM: server lost leadership, performing leader stepdown actions
2020-05-31T19:05:10.137704041Z nats-streaming-1 [1] 2020/05/31 19:05:10.137635 [INF] STREAM: raft: aborting pipeline replication to peer {Voter nats-streaming-2 namespace.nats-streaming-2.namespace}
2020-05-31T19:05:10.138956578Z nats-streaming-1 [1] 2020/05/31 19:05:10.138856 [INF] STREAM: finished leader stepdown actions
2020-05-31T19:05:15.351767734Z nats-streaming-0 [1] 2020/05/31 19:05:15.351592 [WRN] STREAM: raft: Election timeout reached, restarting election
2020-05-31T19:05:15.351924160Z nats-streaming-0 [1] 2020/05/31 19:05:15.351839 [INF] STREAM: raft: Node at namespace.nats-streaming-0.namespace [Candidate] entering Candidate state in term 276802
2020-05-31T19:05:15.359643816Z nats-streaming-1 [1] 2020/05/31 19:05:15.358527 [DBG] STREAM: raft: lost leadership because received a requestvote with newer term
2020-05-31T19:05:15.373592387Z nats-streaming-0 [1] 2020/05/31 19:05:15.372560 [DBG] STREAM: raft: Votes needed: 2
2020-05-31T19:05:15.373634865Z nats-streaming-0 [1] 2020/05/31 19:05:15.372594 [DBG] STREAM: raft: Vote granted from nats-streaming-1 in term 276802. Tally: 1
2020-05-31T19:05:15.373642009Z nats-streaming-0 [1] 2020/05/31 19:05:15.372603 [DBG] STREAM: raft: Vote granted from nats-streaming-0 in term 276802. Tally: 2
2020-05-31T19:05:15.373647927Z nats-streaming-0 [1] 2020/05/31 19:05:15.372651 [INF] STREAM: raft: Election won. Tally: 2
2020-05-31T19:05:15.373653817Z nats-streaming-0 [1] 2020/05/31 19:05:15.372673 [INF] STREAM: raft: Node at namespace.nats-streaming-0.namespace [Leader] entering Leader state
2020-05-31T19:05:15.373660086Z nats-streaming-0 [1] 2020/05/31 19:05:15.372786 [INF] STREAM: raft: Added peer nats-streaming-1, starting replication
2020-05-31T19:05:15.374778622Z nats-streaming-0 [1] 2020/05/31 19:05:15.374665 [INF] STREAM: raft: Added peer nats-streaming-2, starting replication
2020-05-31T19:05:15.376022408Z nats-streaming-2 [1] 2020/05/31 19:05:15.375459 [WRN] STREAM: raft: Rejecting vote request from namespace.nats-streaming-0.namespace since we have a leader: namespace.nats-streaming-1.namespace
2020-05-31T19:05:15.423738276Z nats-streaming-0 [1] 2020/05/31 19:05:15.423530 [INF] STREAM: server became leader, performing leader promotion actions
2020-05-31T19:05:15.423893835Z nats-streaming-0 [1] 2020/05/31 19:05:15.423819 [INF] STREAM: raft: pipelining replication to peer {Voter nats-streaming-1 namespace.nats-streaming-1.namespace}
2020-05-31T19:05:15.424079954Z nats-streaming-0 [1] 2020/05/31 19:05:15.424006 [INF] STREAM: raft: pipelining replication to peer {Voter nats-streaming-2 namespace.nats-streaming-2.namespace}
2020-05-31T19:05:15.442711747Z nats-streaming-0 [1] 2020/05/31 19:05:15.442530 [INF] STREAM: finished leader promotion actions
2020-05-31T19:05:15.442760165Z nats-streaming-0 [1] 2020/05/31 19:05:15.442559 [INF] STREAM: Streaming Server is ready

At this point, nats-streaming-0 has taken over and begins delivering messages to clients starting with sequence id 1!

2020-05-31T19:05:15.453934238Z nats-streaming-0 [1] 2020/05/31 19:05:15.453720 [TRC] STREAM: [Client:ac43bf10322f-[some-uuid]] Delivering msg to subid=1083, subject=[redacted], seq=1
2020-05-31T19:05:15.454144601Z nats-streaming-0 [1] 2020/05/31 19:05:15.454019 [TRC] STREAM: [Client:6d8ff6f40b18-[some-uuid]] Delivering msg to subid=1082, subject=[redacted], seq=2
2020-05-31T19:05:15.454165273Z nats-streaming-0 [1] 2020/05/31 19:05:15.454081 [TRC] STREAM: [Client:a3b7a1f1e867-[some-uuid]] Delivering msg to subid=1081, subject=[redacted], seq=3

Context:

NATS Streaming running via docker.io/nats-streaming:0.17.0
NATS running via docker.io/bitnami/nats:2.1.6
NATS Streaming is clustered with 3 nodes
Subscribers with redelivered messages were in a queue group

Expected behavior

Even in the case of a network partition, a member rejoining the cluster should:

Not result in the leader changing
Not result in messages that have been previously acked should never be replayed.

Why did nats-streaming-0 start replaying messages from 2019?

Actual behavior

A network partition resulted in an isolated node rejoining and:

Forcing the leader to give up its own leadership
The current sequence ID was apparently lost, and very old messages were replayed

The text was updated successfully, but these errors were encountered:

kozlovic · 2020-06-01T21:30:42Z

Thank you for the report, I will look into that.

AaronFriel · 2020-06-01T21:50:45Z

@kozlovic do you know what could cause that, and what I could possibly do to mitigate this?

For example, is it possible to connect a subscriber to a queue and make sure that existing/old messages are acked and removed from the file store? It appears that old messages are being stored until capacity is reached.

kozlovic · 2020-06-01T22:08:02Z

do you know what could cause that, and what I could possibly do to mitigate this?

I did not have a chance to investigate yet.

For example, is it possible to connect a subscriber to a queue and make sure that existing/old messages are acked and removed from the file store? It appears that old messages are being stored until capacity is reached.

No it is not. As explained here, in NATS Streaming, messages are stored regardless of interest and messages are removed based on limits, not when messages are consumed.

You could recreate your subscriptions with a more recent start position (either sequence if you have an idea of what the last sequence was, or based on time): https://docs.nats.io/developing-with-nats-streaming/receiving.

kozlovic · 2020-06-02T20:31:26Z

I am sorry but I am unable to reproduce or explain what has happened. I am just wondering regarding the "delivery at sequence 1" if it would be possible that your applications got their subscriptions recreated with "deliver all available" or sequence 1 following a connection lost notification for instance? In other words, could it be that your apps have recreated their subscription starting from scratch?

The reason I ask is that even in the weird situation where nats-streaming-0 became leader with a newer term than the other ones, why would the state of the existing subscriptions be lost? And if they were, it would not have been able to deliver at all since the subscriptions would be not known to that server.

Any specific cluster configuration that you would want to share so I could try with those to see if that helps reproduce the issue?

AaronFriel · 2020-06-02T21:22:18Z

  const nats = await natsConfig.connection.get();

  const opts = nats.subscriptionOptions();
  opts.setDeliverAllAvailable();
  opts.setMaxInFlight(maxInFlight);
  opts.setAckWait(ackTimeout);
  opts.setManualAckMode(true);
  opts.setDurableName(durableName);

  logger.info(
    { durableName, subject, ackTimeout, maxInFlight },
    `Subscribing to durable queue ${subject}`,
  );

  const durableSub: STAN.Subscription & { ackInbox?: string } = nats.subscribe(subject, `${subject}-workers`, opts);

We've had subscribers join and leave with these options before and they don't start with sequence ID 1, they start with un-acked messages for the durable queue group.

Is setDeliverAllAvailable causing it to start with sequence ID 1 when NATS fails over to another cluster leader? If so, is there another option that should be set to only send un-acked messages for the durable queue group?

I see this in the docs, so maybe I shouldn't be sending a start position at all?

If an application wishes to resume message consumption from where it previously stopped, it needs to create a durable subscription. It does so by providing a durable name, which is combined with the client ID provided when the client created its connection. The server then maintains the state for this subscription even after the client connection is closed.

Note: The starting position given by the client when restarting a durable subscription is ignored.
When the application wants to stop receiving messages on a durable subscription, it should close - but not unsubscribe - this subscription. If a given client library does not have the option to close a subscription, the application should close the connection instead.

kozlovic · 2020-06-02T21:24:57Z

That is fine. The start position for a durable subscription is used only when the durable is created for the first time.

I was asking about the streaming cluster configuration. Is there any non-basic configuration options that you may have set?

AaronFriel · 2020-06-02T21:42:43Z


-m ${monitoringPort}
-ns ${natsServerUrl}
--user ${natsClientUser}
--pass ${pass}
-mc 0
-msu 1000
-mm 1000000
-mb 900mb
-ma 0
-hbi 5s
-hbt 10s
-hbf 5
--store file
--dir ${dataVolumePath}/${name}/$(${POD_NAME_VAR})/data
--file_compact_enabled
--file_compact_frag 50
--file_compact_interval 300
--file_compact_min_size 1048576
--file_buffer_size 2097152
--file_crc
--file_crc_poly 3988292384
--file_sync
--file_slice_max_msgs 0
--file_slice_max_bytes 67108931
--file_slice_max_age 0
--file_fds_limit 0
--file_parallel_recovery 1
-c ${configFileName}

--clustered
--cluster_id ${config.clusterId}
--cluster_node_id $(${POD_NAME_VAR})
--cluster_peers ${statefulSetPodNames}
--cluster_log_path ${dataVolumePath}/${name}/$(${POD_NAME_VAR})/raft
--cluster_raft_logging
-SDV

Config file is:

streaming {
  cluster: {
    raft_heartbeat_timeout: "10s"
    raft_election_timeout: "10s"
    raft_lease_timeout: "5s"
    raft_commit_timeout: "250ms"
  }
}

Anything stand out as odd?

It sure seems like the start position/index or ack state of some message subjects (not all) were lost in the failover.

kozlovic · 2020-06-10T19:04:06Z

Still no luck finding what the problem is. Could you please make sure that there is no chance that you have another cluster with the same cluster-id that for some reason would have suddenly be connected to the prod cluster?
I have seen users trying to join Streaming clusters in NATS clusters connected with gateways, and that is bad if they share the same account because since the nodes are named the same in both clusters, that causes havoc: a node "a" in one cluster is not distinguishable from "a" in the other.

AaronFriel · 2020-06-12T06:45:11Z

No other clusters or any interesting topologies, the cluster is network isolated using Kubernetes network policies such that the 3 nodes can only communicate with each other and with subscribers pods have the correct label.

AaronFriel · 2020-06-16T21:29:05Z

@kozlovic could a corrupt state in the file store be the cause? It looks like a network blip caused the leader to fail over back from nats-streaming-0 to nats-streaming-1 and more old messages (8 to 10 months old) were replayed.

A mitigation we implemented to ack and ignore old messages prevented them from being replayed, but still this is extremely concerning. With high reliability, a failover in our nats streaming cluster causes the durable queue groups position to reset back to message position 1.

kozlovic · 2020-06-16T22:11:14Z

@AaronFriel I am very sorry that I could not find the reason for this issue. It seem strange to me that if a node incorrectly becomes leader (as described in the first part of the issue), it would not have the state of old subscriptions. To become leader means that it has higher term and store log index..
If you had non durable subscriptions, I could see that if there is an issue and that server gets elected and the clients are not adequately configured that they don't handle the amount of time there is no leader, they could fail with a connection lost situation (client sending PINGs to leader that does not answer and so client consider the connection lost). In this case, if the application were to recreate the subscriptions and if reusing the same client ID, the server would replace the old one (close the connection therefore removing the subscription) and accept the new one, and if it is deliver all available, then that could explain it. But you are using durable queue subscriptions, which what I just said would not happen (even likely not with non durable queue subs as long as there are more than 1 member).
I see that you are using streaming debug/trace. You pointed out to this:

2020-05-31T19:05:15.453934238Z nats-streaming-0 [1] 2020/05/31 19:05:15.453720 [TRC] STREAM: [Client:ac43bf10322f-[some-uuid]] Delivering msg to subid=1083, subject=[redacted], seq=1
2020-05-31T19:05:15.454144601Z nats-streaming-0 [1] 2020/05/31 19:05:15.454019 [TRC] STREAM: [Client:6d8ff6f40b18-[some-uuid]] Delivering msg to subid=1082, subject=[redacted], seq=2
2020-05-31T19:05:15.454165273Z nats-streaming-0 [1] 2020/05/31 19:05:15.454081 [TRC] STREAM: [Client:a3b7a1f1e867-[some-uuid]] Delivering msg to subid=1081, subject=[redacted], seq=3

You should be able then to find in the trace if this subscription was created, etc.. before that happened.
I understand that you may have sensitive information, but if you were willing to send those to me with a secure channel (not attach them here), and point me to the actual destination having the issue, maybe I could get a clue on when/where the subscriptions on this destination got created/recreated.
The point is I am running out of ideas on what could have caused this to happen.

Alternatively, we may get in a google hangout or zoom meeting to try to see if we could debug this a bit more (without you having to send the logs, etc..).

AaronFriel · 2020-06-16T22:46:28Z

@kozlovic We haven't changed our subscriptions recently, that's all version controlled and I can see that nothing has changed there in many weeks.

Happy to hop on a video chat, can you send me an email to aaron@twochairs.com?

piotrlipiarz-ef · 2020-11-13T17:20:26Z

@AaronFriel have you managed to resolve your issue? We are currently experiencing very similar issue, receiving messages acknowledged days ago. We recently correlated this with server exiting and logs like

[1] 2020/11/13 15:45:51.911086 [INF] STREAM: raft: aborting pipeline replication to peer {Voter node-2 tf-nats.node-2.tf-nats}

AaronFriel · 2021-01-19T03:12:42Z

@piotrlipiarz-ef We have yet to diagnose the issue.

WellsCui · 2021-03-30T21:51:21Z

We have the same issue. When the leader changed, the subscription lost in client side. The client side need to be restarted to resume the subscription.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051

Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051

AaronFriel commented Jun 1, 2020 •

edited

Loading

kozlovic commented Jun 1, 2020

AaronFriel commented Jun 1, 2020

kozlovic commented Jun 1, 2020

kozlovic commented Jun 2, 2020

AaronFriel commented Jun 2, 2020

kozlovic commented Jun 2, 2020

AaronFriel commented Jun 2, 2020

kozlovic commented Jun 10, 2020

AaronFriel commented Jun 12, 2020 •

edited

Loading

AaronFriel commented Jun 16, 2020 •

edited

Loading

kozlovic commented Jun 16, 2020

AaronFriel commented Jun 16, 2020

piotrlipiarz-ef commented Nov 13, 2020

AaronFriel commented Jan 19, 2021

WellsCui commented Mar 30, 2021

Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051

Cluster change resulting in thousands of messages being replaced starting at seqId=1 #1051

Comments

AaronFriel commented Jun 1, 2020 • edited Loading

Expected behavior

Actual behavior

kozlovic commented Jun 1, 2020

AaronFriel commented Jun 1, 2020

kozlovic commented Jun 1, 2020

kozlovic commented Jun 2, 2020

AaronFriel commented Jun 2, 2020

kozlovic commented Jun 2, 2020

AaronFriel commented Jun 2, 2020

kozlovic commented Jun 10, 2020

AaronFriel commented Jun 12, 2020 • edited Loading

AaronFriel commented Jun 16, 2020 • edited Loading

kozlovic commented Jun 16, 2020

AaronFriel commented Jun 16, 2020

piotrlipiarz-ef commented Nov 13, 2020

AaronFriel commented Jan 19, 2021

WellsCui commented Mar 30, 2021

AaronFriel commented Jun 1, 2020 •

edited

Loading

AaronFriel commented Jun 12, 2020 •

edited

Loading

AaronFriel commented Jun 16, 2020 •

edited

Loading