Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct ae.commit on recovery to equal call to applyCommit(index) #5946

Closed
wants to merge 1 commit into from

Conversation

MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Oct 1, 2024

A stream would be removed from the cluster when hard-killing all servers directly after the stream has been added (or at least before meta was snapshotted).

To reproduce:

# setup cluster of 3 servers
# once ready, add a R3 stream
nats str add stream --defaults --subjects stream --replicas 3

# now hard kill all servers (I had them running locally with nats-0 on 4222, nats-1 on 4333, and nats-2 on 4444)
fuser 4444/tcp -k -SIGKILL
fuser 4333/tcp -k -SIGKILL
fuser 4222/tcp -k -SIGKILL

Then restart all servers and notice the stream being removed once a new meta leader has been chosen:

$ nats server report jetstream
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                               JetStream Summary                                               │
├─────────┬──────────────┬─────────┬───────────┬──────────┬───────┬────────┬──────┬─────────┬─────────┬─────────┤
│ Server  │ Cluster      │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │ Pending │
├─────────┼──────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┼─────────┤
│ nats-0* │ nats-cluster │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │       0 │
│ nats-1  │ nats-cluster │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │       0 │
│ nats-2  │ nats-cluster │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │       0 │
├─────────┼──────────────┼─────────┼───────────┼──────────┼───────┼────────┼──────┼─────────┼─────────┼─────────┤
│         │              │ 0       │ 0         │ 0        │ 0 B   │ 0 B    │ 0 B  │ 0       │ 0       │       0 │
╰─────────┴──────────────┴─────────┴───────────┴──────────┴───────┴────────┴──────┴─────────┴─────────┴─────────╯

On recovery the only snapshot the server has doesn't contain the stream, as the snapshot was created before the stream was. When a new meta leader is chosen it will create a snapshot and send out to the followers, but because that snapshot is empty the stream will be removed.

This PR ensures that the stream addition is properly replayed on recovery.
Also added a test that reproduces the hard kill scenario, by copying and reverting storage directories to previous state.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

@@ -3452,8 +3452,6 @@ func (js *jetStream) processStreamAssignment(sa *streamAssignment) bool {
return false
}

var didRemove bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didRemove used to be set a long time ago, but now it would always be false.

To ensure the meta snapshot is written both for creation and removal, return true at the end instead.

@MauriceVanVeen MauriceVanVeen force-pushed the fix/stream-removed-after-hard-kill branch 2 times, most recently from 3893551 to 650b155 Compare October 1, 2024 13:22
@derekcollison
Copy link
Member

We do not need to snapshot as the log with the add will be replayed. Do you mean the whole machine goes away?

@MauriceVanVeen
Copy link
Member Author

MauriceVanVeen commented Oct 1, 2024

We do not need to snapshot as the log with the add will be replayed.

I haven't seen that to be the case.

Same machine and same storage, just a non-graceful shutdown so the meta snapshot can't be written before restart.

@derekcollison
Copy link
Member

Most of the time the snapshot is not written, and the log remaining after any given snapshot is replayed.

I was testing something the other day and disabled meta snapshots in the monitorCluster (and setLeader) calls and behavior was ok.

Do we have a test that can show what you are seeing?

@MauriceVanVeen
Copy link
Member Author

MauriceVanVeen commented Oct 1, 2024

Do we have a test that can show what you are seeing?

Yeah, this PR has a test that simulates a hard kill by copying the storage directories after AddStream returns successfully.

As well as spinning up a cluster myself locally and doing a real hard kill, results in an empty snapshot being sent and the stream being removed.
Recovery did not show the stream assignment to be replayed.

@derekcollison
Copy link
Member

Sending empty snapshot could be a problem. We should protect against that for sure, but not force a snapshot IMO on add stream.

@neilalexander
Copy link
Member

This might be related to those calls to SendSnapshot() that we definitely need to revisit.

@MauriceVanVeen
Copy link
Member Author

and the log remaining after any given snapshot is replayed

Looked at the replaying, and it does replay it at least.
But when it calls into processAppendEntry nothing really seems to happen there. I would have expected applyCommit to be called?

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the fix/stream-removed-after-hard-kill branch from 90ed0ab to 191de4a Compare October 1, 2024 16:24
@MauriceVanVeen MauriceVanVeen changed the title Fix meta should snapshot after creating/removing a stream Correct ae.commit on recovery to equal call to applyCommit(index) Oct 1, 2024
@MauriceVanVeen
Copy link
Member Author

Pushed a fix for an off-by-one that fixes this issue as well.

index=6 for example, so we would normally call applyCommit(index). But the AppendEntry has commit=5. So if the n.commit=5 on recovery it would not be applied since ae.commit > n.commit is false. Correcting ae.commit to equal to index, that ensures we'll call applyCommit in processAppendEntry.
https://github.com/nats-io/nats-server/pull/5946/files#diff-0a3d4e9b87bb3880bdb735fa489ef618b571de3370dae31d8f509b26e2fb852eR475

@MauriceVanVeen
Copy link
Member Author

The above was an incorrect fix, instead came to another alternative which actually turned out to be @neilalexander's PR #5700.

This PR can be closed when Neil's PR is approved/merged 🎉

@MauriceVanVeen
Copy link
Member Author

Test was included in #5700 and got merged, closing.

@MauriceVanVeen MauriceVanVeen deleted the fix/stream-removed-after-hard-kill branch October 4, 2024 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants