[IMPROVED] Catchup improvements #3348

derekcollison · 2022-08-08T14:08:04Z

Some general stability improvements to catchup logic.

Signed-off-by: Derek Collison derek@nats.io

/cc @nats-io/core

kozlovic

Need to add a checkFor() in a test to avoid flapping and made some suggestions.

server/jetstream_cluster_test.go

server/jetstream_cluster.go

kozlovic · 2022-08-08T15:50:36Z

server/jetstream_cluster.go

@@ -6637,11 +6678,21 @@ RETRY:
 				} else {
 					s.Warnf("Catchup for stream '%s > %s' errored, will retry: %v", mset.account(), mset.name(), err)
 					msgsQ.recycle(&mrecs)
+


I was thinking that in any condition where we return an error or we goto RETRY, we should use the current mrec's reply to send the error back to the leader. The leader does not care of the content at the moment, so it would be backward compatible, but in runCatchup() we could have the sub check that if the FC reply's body is not empty, we know that the remote has stopped the catchup and the body content is the error message.
We would need to coordinate with that sub and the rest of the go routine loop, but that is perfectly doable. This would help by "releasing" resources on the runCatchup because otherwise that routine will wait up to 5 seconds without activity from the remote before returning (which then releases the outb that counts toward the server-wide limit if I am not mistaken).

I like that idea, once I free up will take a look.

kozlovic

LGTM for current changes, not sure if you want to add the "return error to leader" as a new commit in this PR or not.

derekcollison · 2022-08-08T16:58:11Z

I say we do, I am tied up for a few hours, want me to merge and you have a go at it? If not will circle back in a few hours.

We would send skip messages for a sync request that was completely below our current state, but this could be more traffic then we might want. Now we only send EOF and the other side can detect the skip forward and adjust on a successful catchup. We still send skips if we can partially fill the sync request. Signed-off-by: Derek Collison <derek@nats.io>

Signed-off-by: Derek Collison <derek@nats.io>

I noticed some contention when I was investigating a catchup bug on the server write lock. Medium term we could have a separate lock, longer term formal client support in the server will alleviate. Signed-off-by: Derek Collison <derek@nats.io>

Signed-off-by: Derek Collison <derek@nats.io>

kozlovic · 2022-08-08T17:17:26Z

@derekcollison I rebased, resolved the conflicts and push forced. Will check that we have green and then I can merge.

When we're being asked to provide data within a range during catchup, we should not extend that range and provide more data. Especially since that range was defined by a snapshot, which also specifies which RAFT entries should be sent after processing that snapshot. This would just result in duplicated work and possible desync for the follower, so these lines can safely be removed. **Timeline of these lines:** Previously when receiving a catchup request the `FirstSeq` could be moved up to match what the state says: [[IMPROVED] Catchup improvements #3348](https://github.com/nats-io/nats-server/pull/3348/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R7132-R7138) (August, 2022) Afterward this was removed in favor of only extending the `LastSeq`: [[FIXED] KeyValue not found after server restarts #5054](https://github.com/nats-io/nats-server/pull/5054/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R8579-R8588) (February, 2024) This was done to solve for KeyValue not found issues. However this change would have also fixed that case: [[FIXED] Do not bump clfs on seq mismatch when before stream LastSeq #5821](https://github.com/nats-io/nats-server/pull/5821/files#diff-2f4991438bb868a8587303cde9107f83127e88ad70bd19d5c6a31c238a20c299R4694-R4699) (August, 2024) Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison requested review from kozlovic and matthiashanel August 8, 2022 14:08

kozlovic requested changes Aug 8, 2022

View reviewed changes

kozlovic approved these changes Aug 8, 2022

View reviewed changes

derekcollison added 9 commits August 8, 2022 11:06

Fix up some processing during account purge to fix flapping tests

a511900

Signed-off-by: Derek Collison <derek@nats.io>

Write lock not needed

3407112

Signed-off-by: Derek Collison <derek@nats.io>

Additional stability improvements for catchup.

e635de7

Signed-off-by: Derek Collison <derek@nats.io>

Attempt to improve long RTT catchup time during stream moves.

758b733

Signed-off-by: Derek Collison <derek@nats.io>

Make sure empty msgs do not interfere with catchup process.

33526f4

Signed-off-by: Derek Collison <derek@nats.io>

Make a check loop based on review feedback.

906afcc

Signed-off-by: Derek Collison <derek@nats.io>

Reset activity interval on catchup to default vs ramp up. Tweak test.

06112d6

Signed-off-by: Derek Collison <derek@nats.io>

kozlovic force-pushed the catchup_improvements branch from 710e987 to 06112d6 Compare August 8, 2022 17:16

kozlovic merged commit 5924cd6 into main Aug 8, 2022

kozlovic deleted the catchup_improvements branch August 8, 2022 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVED] Catchup improvements #3348

[IMPROVED] Catchup improvements #3348

derekcollison commented Aug 8, 2022

kozlovic left a comment

kozlovic Aug 8, 2022

derekcollison Aug 8, 2022

kozlovic left a comment

derekcollison commented Aug 8, 2022

kozlovic commented Aug 8, 2022

[IMPROVED] Catchup improvements #3348

[IMPROVED] Catchup improvements #3348

Conversation

derekcollison commented Aug 8, 2022

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic Aug 8, 2022

Choose a reason for hiding this comment

derekcollison Aug 8, 2022

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

derekcollison commented Aug 8, 2022

kozlovic commented Aug 8, 2022