Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(2.11) Don't send meta snapshot when becoming metaleader #5700

Merged
merged 3 commits into from
Oct 3, 2024

Conversation

neilalexander
Copy link
Member

@neilalexander neilalexander commented Jul 25, 2024

Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet.

We will want to audit all uses of SendSnapshot as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that SendSnapshot usage may have been papering over.

Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to processStreamRemoval.

We've also added a new unit test TestJetStreamClusterHardKillAfterStreamAdd for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable.

Signed-off-by: Neil Twigg neil@nats.io

@neilalexander neilalexander requested a review from a team as a code owner July 25, 2024 09:05
@neilalexander neilalexander marked this pull request as draft July 25, 2024 11:36
Antithesis testing has found that late or out-of-order delivery of these
snapshots, likely due to latency or thread pauses, can cause stream
assignments to be reverted which results in assets being deleted and
recreated. There may also be a race condition where the metalayer comes
up before network connectivity is fully established so we may end up
generating snapshots that don't include assets we don't know about yet.

We will want to audit all uses of `SendSnapshot` as it somewhat breaks
the consistency model, especially now that we have fixed a significant
number of Raft bugs that `SendSnapshot` usage may have been papering over.

Further Antithesis runs without this code run fine and have eliminated
a number of unexpected calls to `processStreamRemoval`.

Signed-off-by: Neil Twigg <neil@nats.io>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@@ -1703,7 +1703,7 @@ func (o *consumer) deleteNotActive() {
// Don't think this needs to be a monitored go routine.
go func() {
const (
startInterval = 30 * time.Second
startInterval = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being changed in a PR about meta snapshots?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending the snapshot was forcing the ghost consumers tests to pass for the wrong reason.

When we removed it, the test started to flake unless it was given over a minute to run. Reducing the initial time to clean up those ghost consumers de-flaked those tests without having to extend the test to such a long runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but that can dramatically impact production systems with 10s or 100s of thousands of these. Hence the 30s vs shorter.

If need be make a var off of a const and set var to what you need in test and reset at the end of the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will do tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact have just done now, as it was a quick change.

Copy link
Member Author

@neilalexander neilalexander Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed the accidental const () that was left over too. Editor language server got me.

Signed-off-by: Neil Twigg <neil@nats.io>
Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit acbca0f into main Oct 3, 2024
5 checks passed
@derekcollison derekcollison deleted the neil/jsmetasnap branch October 3, 2024 21:40
neilalexander added a commit that referenced this pull request Nov 19, 2024
Antithesis testing has found that late or out-of-order delivery of these
snapshots, likely due to latency or thread pauses, can cause stream
assignments to be reverted which results in assets being deleted and
recreated. There may also be a race condition where the metalayer comes
up before network connectivity to all other nodes is fully established
so we may end up generating snapshots that don't include assets we don't
know about yet.

We will want to audit all uses of `SendSnapshot` as it somewhat breaks
the consistency model, especially now that we have fixed a significant
number of Raft bugs that `SendSnapshot` usage may have been papering
over.

Further Antithesis runs without this code run fine and have eliminated a
number of unexpected calls to `processStreamRemoval`.

We've also added a new unit test
`TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as
well as a couple tweaks to the ghost consumer tests to make them
reliable.

Signed-off-by: Neil Twigg <neil@nats.io>

---------

Signed-off-by: Neil Twigg <neil@nats.io>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
derekcollison added a commit that referenced this pull request Nov 20, 2024
We could have an empty apply queue length, but have stored uncommitted
entries. If we then call `SendSnapshot` when becoming consumer leader we
would be reverting back to previous state.

This was also an issue for meta leader changes, which was fixed in
#5700. This PR fixes it for
consumer leader changes.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
MauriceVanVeen pushed a commit that referenced this pull request Nov 21, 2024
We could have an empty apply queue length, but have stored uncommitted
entries. If we then call `SendSnapshot` when becoming consumer leader we
would be reverting back to previous state.

This was also an issue for meta leader changes, which was fixed in
#5700. This PR fixes it for
consumer leader changes.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander added a commit that referenced this pull request Nov 22, 2024
Antithesis testing has found that late or out-of-order delivery of these
snapshots, likely due to latency or thread pauses, can cause stream
assignments to be reverted which results in assets being deleted and
recreated. There may also be a race condition where the metalayer comes
up before network connectivity to all other nodes is fully established
so we may end up generating snapshots that don't include assets we don't
know about yet.

We will want to audit all uses of `SendSnapshot` as it somewhat breaks
the consistency model, especially now that we have fixed a significant
number of Raft bugs that `SendSnapshot` usage may have been papering
over.

Further Antithesis runs without this code run fine and have eliminated a
number of unexpected calls to `processStreamRemoval`.

We've also added a new unit test
`TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as
well as a couple tweaks to the ghost consumer tests to make them
reliable.

Signed-off-by: Neil Twigg <neil@nats.io>

---------

Signed-off-by: Neil Twigg <neil@nats.io>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander added a commit that referenced this pull request Nov 25, 2024
Antithesis testing has found that late or out-of-order delivery of these
snapshots, likely due to latency or thread pauses, can cause stream
assignments to be reverted which results in assets being deleted and
recreated. There may also be a race condition where the metalayer comes
up before network connectivity to all other nodes is fully established
so we may end up generating snapshots that don't include assets we don't
know about yet.

We will want to audit all uses of `SendSnapshot` as it somewhat breaks
the consistency model, especially now that we have fixed a significant
number of Raft bugs that `SendSnapshot` usage may have been papering
over.

Further Antithesis runs without this code run fine and have eliminated a
number of unexpected calls to `processStreamRemoval`.

We've also added a new unit test
`TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as
well as a couple tweaks to the ghost consumer tests to make them
reliable.

Signed-off-by: Neil Twigg <neil@nats.io>

---------

Signed-off-by: Neil Twigg <neil@nats.io>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
neilalexander added a commit that referenced this pull request Nov 25, 2024
Includes the following:

- #5661
- #5666
- #5671
- #5344
- #5684
- #5689
- #5691
- #5714
- #5717
- #5707
- #5792
- #5912
- #5957
- #5700
- #5975
- #5991
- #5987
- #6027
- #6038
- #6053
- #5848
- #6055
- #6056
- #6060
- #6061
- #6072
- #5832
- #6073
- #6107

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants