(2.11) Don't send meta snapshot when becoming metaleader #5700

neilalexander · 2024-07-25T09:05:30Z

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet.

We will want to audit all uses of SendSnapshot as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that SendSnapshot usage may have been papering over.

Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to processStreamRemoval.

We've also added a new unit test TestJetStreamClusterHardKillAfterStreamAdd for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable.

Signed-off-by: Neil Twigg neil@nats.io

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. Signed-off-by: Neil Twigg <neil@nats.io>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison · 2024-10-03T20:10:27Z

server/consumer.go

@@ -1703,7 +1703,7 @@ func (o *consumer) deleteNotActive() {
 			// Don't think this needs to be a monitored go routine.
 			go func() {
 				const (
-					startInterval = 30 * time.Second
+					startInterval = 5 * time.Second


Why is this being changed in a PR about meta snapshots?

Sending the snapshot was forcing the ghost consumers tests to pass for the wrong reason.

When we removed it, the test started to flake unless it was given over a minute to run. Reducing the initial time to clean up those ghost consumers de-flaked those tests without having to extend the test to such a long runtime.

Yes but that can dramatically impact production systems with 10s or 100s of thousands of these. Hence the 30s vs shorter.

If need be make a var off of a const and set var to what you need in test and reset at the end of the test.

OK will do tomorrow.

In fact have just done now, as it was a quick change.

Just removed the accidental const () that was left over too. Editor language server got me.

Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison

LGTM

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

Includes the following: - #5661 - #5666 - #5671 - #5344 - #5684 - #5689 - #5691 - #5714 - #5717 - #5707 - #5792 - #5912 - #5957 - #5700 - #5975 - #5991 - #5987 - #6027 - #6038 - #6053 - #5848 - #6055 - #6056 - #6060 - #6061 - #6072 - #5832 - #6073 - #6107 Signed-off-by: Neil Twigg <neil@nats.io>

neilalexander requested a review from a team as a code owner July 25, 2024 09:05

neilalexander marked this pull request as draft July 25, 2024 11:36

neilalexander force-pushed the neil/jsmetasnap branch from c6cf93d to c20906a Compare July 30, 2024 15:13

neilalexander force-pushed the neil/jsmetasnap branch from c20906a to 5549baf Compare September 13, 2024 12:59

neilalexander force-pushed the neil/jsmetasnap branch from 5549baf to e576cf1 Compare October 2, 2024 09:16

Test hard kill after stream add should not remove stream

82c1371

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the neil/jsmetasnap branch from 0858231 to 9ef2ed2 Compare October 2, 2024 12:51

neilalexander marked this pull request as ready for review October 2, 2024 13:46

MauriceVanVeen mentioned this pull request Oct 2, 2024

Correct ae.commit on recovery to equal call to applyCommit(index) #5946

Closed

derekcollison reviewed Oct 3, 2024

View reviewed changes

neilalexander force-pushed the neil/jsmetasnap branch from b45f105 to 9109439 Compare October 3, 2024 20:51

derekcollison self-requested a review October 3, 2024 20:54

Factor out consumer cleanup times to deflake orphaned consumer tests

03ed9c1

Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander force-pushed the neil/jsmetasnap branch from 9109439 to 03ed9c1 Compare October 3, 2024 20:56

derekcollison approved these changes Oct 3, 2024

View reviewed changes

derekcollison merged commit acbca0f into main Oct 3, 2024
5 checks passed

derekcollison deleted the neil/jsmetasnap branch October 3, 2024 21:40

MauriceVanVeen mentioned this pull request Nov 20, 2024

[FIXED] Don't SendSnapshot on becoming consumer leader #6151

Merged

neilalexander mentioned this pull request Nov 25, 2024

Cherry-picks for 2.10.23-RC.5 #6171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(2.11) Don't send meta snapshot when becoming metaleader #5700

(2.11) Don't send meta snapshot when becoming metaleader #5700

neilalexander commented Jul 25, 2024 •

edited

Loading

derekcollison Oct 3, 2024

neilalexander Oct 3, 2024

derekcollison Oct 3, 2024

neilalexander Oct 3, 2024

neilalexander Oct 3, 2024

neilalexander Oct 3, 2024 •

edited

Loading

derekcollison left a comment

(2.11) Don't send meta snapshot when becoming metaleader #5700

(2.11) Don't send meta snapshot when becoming metaleader #5700

Conversation

neilalexander commented Jul 25, 2024 • edited Loading

derekcollison Oct 3, 2024

Choose a reason for hiding this comment

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

derekcollison Oct 3, 2024

Choose a reason for hiding this comment

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

neilalexander Oct 3, 2024

Choose a reason for hiding this comment

neilalexander Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

derekcollison left a comment

Choose a reason for hiding this comment

neilalexander commented Jul 25, 2024 •

edited

Loading

neilalexander Oct 3, 2024 •

edited

Loading