-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMPROVED] Catchup improvements #3348
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a checkFor() in a test to avoid flapping and made some suggestions.
@@ -6637,11 +6678,21 @@ RETRY: | |||
} else { | |||
s.Warnf("Catchup for stream '%s > %s' errored, will retry: %v", mset.account(), mset.name(), err) | |||
msgsQ.recycle(&mrecs) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that in any condition where we return an error or we goto RETRY
, we should use the current mrec's reply to send the error back to the leader. The leader does not care of the content at the moment, so it would be backward compatible, but in runCatchup() we could have the sub check that if the FC reply's body is not empty, we know that the remote has stopped the catchup and the body content is the error message.
We would need to coordinate with that sub and the rest of the go routine loop, but that is perfectly doable. This would help by "releasing" resources on the runCatchup because otherwise that routine will wait up to 5 seconds without activity from the remote before returning (which then releases the outb that counts toward the server-wide limit if I am not mistaken).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea, once I free up will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for current changes, not sure if you want to add the "return error to leader" as a new commit in this PR or not.
I say we do, I am tied up for a few hours, want me to merge and you have a go at it? If not will circle back in a few hours. |
We would send skip messages for a sync request that was completely below our current state, but this could be more traffic then we might want. Now we only send EOF and the other side can detect the skip forward and adjust on a successful catchup. We still send skips if we can partially fill the sync request. Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
I noticed some contention when I was investigating a catchup bug on the server write lock. Medium term we could have a separate lock, longer term formal client support in the server will alleviate. Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
710e987
to
06112d6
Compare
@derekcollison I rebased, resolved the conflicts and push forced. Will check that we have green and then I can merge. |
When we're being asked to provide data within a range during catchup, we should not extend that range and provide more data. Especially since that range was defined by a snapshot, which also specifies which RAFT entries should be sent after processing that snapshot. This would just result in duplicated work and possible desync for the follower, so these lines can safely be removed. **Timeline of these lines:** Previously when receiving a catchup request the `FirstSeq` could be moved up to match what the state says: [[IMPROVED] Catchup improvements #3348](https://github.com/nats-io/nats-server/pull/3348/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R7132-R7138) (August, 2022) Afterward this was removed in favor of only extending the `LastSeq`: [[FIXED] KeyValue not found after server restarts #5054](https://github.com/nats-io/nats-server/pull/5054/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R8579-R8588) (February, 2024) This was done to solve for KeyValue not found issues. However this change would have also fixed that case: [[FIXED] Do not bump clfs on seq mismatch when before stream LastSeq #5821](https://github.com/nats-io/nats-server/pull/5821/files#diff-2f4991438bb868a8587303cde9107f83127e88ad70bd19d5c6a31c238a20c299R4694-R4699) (August, 2024) Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Some general stability improvements to catchup logic.
Signed-off-by: Derek Collison derek@nats.io
/cc @nats-io/core