-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(2.11) Don't send meta snapshot when becoming metaleader #5700
Conversation
c6cf93d
to
c20906a
Compare
c20906a
to
5549baf
Compare
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. Signed-off-by: Neil Twigg <neil@nats.io>
5549baf
to
e576cf1
Compare
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
0858231
to
9ef2ed2
Compare
server/consumer.go
Outdated
@@ -1703,7 +1703,7 @@ func (o *consumer) deleteNotActive() { | |||
// Don't think this needs to be a monitored go routine. | |||
go func() { | |||
const ( | |||
startInterval = 30 * time.Second | |||
startInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being changed in a PR about meta snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending the snapshot was forcing the ghost consumers tests to pass for the wrong reason.
When we removed it, the test started to flake unless it was given over a minute to run. Reducing the initial time to clean up those ghost consumers de-flaked those tests without having to extend the test to such a long runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but that can dramatically impact production systems with 10s or 100s of thousands of these. Hence the 30s vs shorter.
If need be make a var off of a const and set var to what you need in test and reset at the end of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will do tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact have just done now, as it was a quick change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed the accidental const ()
that was left over too. Editor language server got me.
b45f105
to
9109439
Compare
Signed-off-by: Neil Twigg <neil@nats.io> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
9109439
to
03ed9c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
We could have an empty apply queue length, but have stored uncommitted entries. If we then call `SendSnapshot` when becoming consumer leader we would be reverting back to previous state. This was also an issue for meta leader changes, which was fixed in #5700. This PR fixes it for consumer leader changes. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet. We will want to audit all uses of `SendSnapshot` as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs that `SendSnapshot` usage may have been papering over. Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to `processStreamRemoval`. We've also added a new unit test `TestJetStreamClusterHardKillAfterStreamAdd` for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable. Signed-off-by: Neil Twigg <neil@nats.io> --------- Signed-off-by: Neil Twigg <neil@nats.io> Signed-off-by: Maurice van Veen <github@mauricevanveen.com> Co-authored-by: Maurice van Veen <github@mauricevanveen.com>
Antithesis testing has found that late or out-of-order delivery of these snapshots, likely due to latency or thread pauses, can cause stream assignments to be reverted which results in assets being deleted and recreated. There may also be a race condition where the metalayer comes up before network connectivity to all other nodes is fully established so we may end up generating snapshots that don't include assets we don't know about yet.
We will want to audit all uses of
SendSnapshot
as it somewhat breaks the consistency model, especially now that we have fixed a significant number of Raft bugs thatSendSnapshot
usage may have been papering over.Further Antithesis runs without this code run fine and have eliminated a number of unexpected calls to
processStreamRemoval
.We've also added a new unit test
TestJetStreamClusterHardKillAfterStreamAdd
for a long-known issue, as well as a couple tweaks to the ghost consumer tests to make them reliable.Signed-off-by: Neil Twigg neil@nats.io