NRG (2.11): Start catchup from `n.commit` & fix AppendEntry is stored at `seq=ae.pindex+1` #5987

MauriceVanVeen · 2024-10-10T16:59:32Z

This PR makes three complementary fixes to the way how catchup and truncating is handled.
Specifically:

when doing n.loadEntry(index) we need to pass where the AppendEntry is in terms of stream sequence, this is equal to ae.pindex+1 since the ae.pindex is the value before it's stored in the stream.
start catchup from n.commit, we could have messages past our commit that have been invalidated and need to be truncated since there was a switch between leaders
because we catchup from n.commit, we check if our local AppendEntry matches terms with the incoming AppendEntry, we only need to truncate if the terms don't match

Signed-off-by: Maurice van Veen github@mauricevanveen.com

derekcollison · 2024-10-11T18:56:07Z

LMK when ready for review.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

…e.pindex+1 Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander

I think this looks good, let's mark for review.

mprimi

Minor style comments on tests

mprimi · 2024-10-14T16:35:31Z

server/jetstream_cluster_4_test.go

+	followerServer2.WaitForShutdown()
+
+	// Although this request will time out, it will be added to the stream leader's WAL.
+	_, err = js.Publish("foo", []byte("first"))


Could set a shorter timeout to make the test faster? (not sure what the default is)

Done, default seemed to be 5 seconds, lowered to 1s.

mprimi · 2024-10-14T16:36:33Z

server/jetstream_cluster_4_test.go

+	streamLeaderServer.WaitForShutdown()
+
+	// Only restart the (previous) followers.
+	followerServer1 = c.restartServer(followerServer1)


Why is one server variable reassigned and not the other?

It's used below to have a connection to that server:

nc, js = jsClientConnect(t, followerServer1)

The connection could be to either server, as long as it's not the (previous) leader. So only this one variable is used to setup the connection.

mprimi · 2024-10-14T16:43:09Z

server/jetstream_cluster_4_test.go

+	rs := c.randomNonStreamLeader(globalAccountName, "TEST")
+	ts := time.Now().UnixNano()
+
+	var scratch [1024]byte


This bit may use an explanation... maybe

Manually add 3 append entries to each node's WAL, except for one node who is one behind

I actually am not sure. Your inner loop goes to 3, but then you have a break at 1.

That description is correct. Two servers will have 3 uncommitted entries, and one server will have 2 uncommitted entries so it needs to catchup for that third one.

Have moved that condition for that one server up, so it's a bit clearer it gets 2 iterations of that loop.

mprimi · 2024-10-14T16:45:22Z

server/jetstream_cluster_4_test.go

+	}
+
+	// Check that the first two published messages came from our WAL, and
+	// the last came from a catchup by another leader.


It seems to me you are doing the same check for all 3 entries you look at, this comment is maybe outdated?

There are 3 checks, 2x require_Equal and 1x require_NotEqual. Have changed it to use require_True with == and != instead, that seems a bit more clear.
(And it doesn't matter what the values being compared are, just that they either match or not)

mprimi · 2024-10-14T16:51:16Z

server/raft_test.go

+					if len(expected) > 0 && int(state.LastSeq-state.FirstSeq+1) != len(expected) {
+						return fmt.Errorf("WAL is different: too many entries")
+					}
+					for index := state.FirstSeq; index <= state.LastSeq; index++ {


This looped check deserves a comment

mprimi · 2024-10-14T16:54:07Z

server/raft_test.go

+	n, err := s.initRaftNode(globalAccountName, cfg, pprofLabels{})
+	require_NoError(t, err)
+
+	encode := func(ae *appendEntry) *appendEntry {


Comment this bit of arcane magic

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander

LGTM

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

…6027) Reverts the changes made in #5987, but the tests are kept. Instead opting for a simpler approach: - removing the `isNew` condition when `pterm` or `pindex` don't match, to ensure consistency even during catchup - move the `ae.pindex == n.pindex` condition up so `pterm` can be corrected (otherwise it would not be executed) Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Neil Twigg <neil@nats.io>

… at `seq=ae.pindex+1` (#5987) This PR makes three complementary fixes to the way how catchup and truncating is handled. Specifically: - when doing `n.loadEntry(index)` we need to pass where the AppendEntry is in terms of stream sequence, this is equal to `ae.pindex+1` since the `ae.pindex` is the value before it's stored in the stream. - start catchup from `n.commit`, we could have messages past our commit that have been invalidated and need to be truncated since there was a switch between leaders - because we catchup from `n.commit`, we check if our local AppendEntry matches terms with the incoming AppendEntry, we only need to truncate if the terms don't match Signed-off-by: Maurice van Veen <github@mauricevanveen.com> --------- Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

… at `seq=ae.pindex+1` (#5987) This PR makes three complementary fixes to the way how catchup and truncating is handled. Specifically: - when doing `n.loadEntry(index)` we need to pass where the AppendEntry is in terms of stream sequence, this is equal to `ae.pindex+1` since the `ae.pindex` is the value before it's stored in the stream. - start catchup from `n.commit`, we could have messages past our commit that have been invalidated and need to be truncated since there was a switch between leaders - because we catchup from `n.commit`, we check if our local AppendEntry matches terms with the incoming AppendEntry, we only need to truncate if the terms don't match Signed-off-by: Maurice van Veen <github@mauricevanveen.com> --------- Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

… at `seq=ae.pindex+1` (#5987) This PR makes three complementary fixes to the way how catchup and truncating is handled. Specifically: - when doing `n.loadEntry(index)` we need to pass where the AppendEntry is in terms of stream sequence, this is equal to `ae.pindex+1` since the `ae.pindex` is the value before it's stored in the stream. - start catchup from `n.commit`, we could have messages past our commit that have been invalidated and need to be truncated since there was a switch between leaders - because we catchup from `n.commit`, we check if our local AppendEntry matches terms with the incoming AppendEntry, we only need to truncate if the terms don't match Signed-off-by: Maurice van Veen <github@mauricevanveen.com> --------- Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

… at `seq=ae.pindex+1` (#5987) This PR makes three complementary fixes to the way how catchup and truncating is handled. Specifically: - when doing `n.loadEntry(index)` we need to pass where the AppendEntry is in terms of stream sequence, this is equal to `ae.pindex+1` since the `ae.pindex` is the value before it's stored in the stream. - start catchup from `n.commit`, we could have messages past our commit that have been invalidated and need to be truncated since there was a switch between leaders - because we catchup from `n.commit`, we check if our local AppendEntry matches terms with the incoming AppendEntry, we only need to truncate if the terms don't match Signed-off-by: Maurice van Veen <github@mauricevanveen.com> --------- Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

Includes the following: - #5661 - #5666 - #5671 - #5344 - #5684 - #5689 - #5691 - #5714 - #5717 - #5707 - #5792 - #5912 - #5957 - #5700 - #5975 - #5991 - #5987 - #6027 - #6038 - #6053 - #5848 - #6055 - #6056 - #6060 - #6061 - #6072 - #5832 - #6073 - #6107 Signed-off-by: Neil Twigg <neil@nats.io>

MauriceVanVeen force-pushed the maurice/truncate-entries-without-quorum branch from 848e1d2 to 5bd3d7e Compare October 10, 2024 19:10

MauriceVanVeen added 6 commits October 14, 2024 17:47

NRG: Test desync after publishing to leader without quorum

05bff04

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

NRG: Fix test rn.term must stay the same

eb44111

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

NRG: Start catchup from n.commit & fix AppendEntry is stored at seq=a…

b8dd252

…e.pindex+1 Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

NRG: only truncate if AppendEntry differs

2fc9d6f

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

NRG: catchup starts from commit

0012d42

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

NRG: Add single RAFT node test

b62c058

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the maurice/truncate-entries-without-quorum branch from 4fff9ec to b62c058 Compare October 14, 2024 15:48

neilalexander approved these changes Oct 14, 2024

View reviewed changes

mprimi reviewed Oct 14, 2024

View reviewed changes

NRG: Feedback

18db9aa

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen marked this pull request as ready for review October 14, 2024 20:43

MauriceVanVeen requested a review from a team as a code owner October 14, 2024 20:43

neilalexander approved these changes Oct 15, 2024

View reviewed changes

derekcollison merged commit ac5ba12 into main Oct 15, 2024
5 checks passed

derekcollison deleted the maurice/truncate-entries-without-quorum branch October 15, 2024 14:02

MauriceVanVeen mentioned this pull request Oct 21, 2024

NRG (2.11): Truncate entries without quorum from different pterm #6027

Merged

MauriceVanVeen added a commit that referenced this pull request Oct 22, 2024

NRG: Revert implementation from #5987

80bd7b4

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander pushed a commit that referenced this pull request Nov 4, 2024

Updated NRG test helpers from #5987

580a39e

Signed-off-by: Neil Twigg <neil@nats.io>

neilalexander pushed a commit that referenced this pull request Nov 15, 2024

NRG: Revert implementation from #5987

b602dbe

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander pushed a commit that referenced this pull request Nov 19, 2024

NRG: Revert implementation from #5987

4c900ac

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander pushed a commit that referenced this pull request Nov 22, 2024

NRG: Revert implementation from #5987

c1cbbb5

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander pushed a commit that referenced this pull request Nov 25, 2024

NRG: Revert implementation from #5987

7ec99f3

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander mentioned this pull request Nov 25, 2024

Cherry-picks for 2.10.23-RC.5 #6171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NRG (2.11): Start catchup from `n.commit` & fix AppendEntry is stored at `seq=ae.pindex+1` #5987

NRG (2.11): Start catchup from `n.commit` & fix AppendEntry is stored at `seq=ae.pindex+1` #5987

MauriceVanVeen commented Oct 10, 2024 •

edited

Loading

derekcollison commented Oct 11, 2024

neilalexander left a comment

mprimi left a comment

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

mprimi Oct 14, 2024

MauriceVanVeen Oct 14, 2024

neilalexander left a comment

NRG (2.11): Start catchup from n.commit & fix AppendEntry is stored at seq=ae.pindex+1 #5987

NRG (2.11): Start catchup from n.commit & fix AppendEntry is stored at seq=ae.pindex+1 #5987

Conversation

MauriceVanVeen commented Oct 10, 2024 • edited Loading

derekcollison commented Oct 11, 2024

neilalexander left a comment

Choose a reason for hiding this comment

mprimi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilalexander left a comment

Choose a reason for hiding this comment

NRG (2.11): Start catchup from `n.commit` & fix AppendEntry is stored at `seq=ae.pindex+1` #5987

NRG (2.11): Start catchup from `n.commit` & fix AppendEntry is stored at `seq=ae.pindex+1` #5987

MauriceVanVeen commented Oct 10, 2024 •

edited

Loading