Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVED] Catchup improvements #3348

Merged
merged 9 commits into from
Aug 8, 2022
Merged

[IMPROVED] Catchup improvements #3348

merged 9 commits into from
Aug 8, 2022

Conversation

derekcollison
Copy link
Member

Some general stability improvements to catchup logic.

Signed-off-by: Derek Collison derek@nats.io

/cc @nats-io/core

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a checkFor() in a test to avoid flapping and made some suggestions.

@@ -6637,11 +6678,21 @@ RETRY:
} else {
s.Warnf("Catchup for stream '%s > %s' errored, will retry: %v", mset.account(), mset.name(), err)
msgsQ.recycle(&mrecs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that in any condition where we return an error or we goto RETRY, we should use the current mrec's reply to send the error back to the leader. The leader does not care of the content at the moment, so it would be backward compatible, but in runCatchup() we could have the sub check that if the FC reply's body is not empty, we know that the remote has stopped the catchup and the body content is the error message.
We would need to coordinate with that sub and the rest of the go routine loop, but that is perfectly doable. This would help by "releasing" resources on the runCatchup because otherwise that routine will wait up to 5 seconds without activity from the remote before returning (which then releases the outb that counts toward the server-wide limit if I am not mistaken).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, once I free up will take a look.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for current changes, not sure if you want to add the "return error to leader" as a new commit in this PR or not.

@derekcollison
Copy link
Member Author

I say we do, I am tied up for a few hours, want me to merge and you have a go at it? If not will circle back in a few hours.

We would send skip messages for a sync request that was completely below our current state, but this could be more traffic then we might want.
Now we only send EOF and the other side can detect the skip forward and adjust on a successful catchup.
We still send skips if we can partially fill the sync request.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
I noticed some contention when I was investigating a catchup bug on the server write lock.
Medium term we could have a separate lock, longer term formal client support in the server will alleviate.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
@kozlovic kozlovic force-pushed the catchup_improvements branch from 710e987 to 06112d6 Compare August 8, 2022 17:16
@kozlovic
Copy link
Member

kozlovic commented Aug 8, 2022

@derekcollison I rebased, resolved the conflicts and push forced. Will check that we have green and then I can merge.

@kozlovic kozlovic merged commit 5924cd6 into main Aug 8, 2022
@kozlovic kozlovic deleted the catchup_improvements branch August 8, 2022 17:28
derekcollison added a commit that referenced this pull request Oct 24, 2024
When we're being asked to provide data within a range during catchup, we
should not extend that range and provide more data. Especially since
that range was defined by a snapshot, which also specifies which RAFT
entries should be sent after processing that snapshot. This would just
result in duplicated work and possible desync for the follower, so these
lines can safely be removed.

**Timeline of these lines:**
Previously when receiving a catchup request the `FirstSeq` could be
moved up to match what the state says:
[[IMPROVED] Catchup improvements
#3348](https://github.com/nats-io/nats-server/pull/3348/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R7132-R7138)
(August, 2022)

Afterward this was removed in favor of only extending the `LastSeq`:
[[FIXED] KeyValue not found after server restarts
#5054](https://github.com/nats-io/nats-server/pull/5054/files#diff-5cb252c37caef12e7027803018861c82724b120ddb62cfedc2f36addf57f6970R8579-R8588)
(February, 2024)

This was done to solve for KeyValue not found issues.
However this change would have also fixed that case:
[[FIXED] Do not bump clfs on seq mismatch when before stream LastSeq
#5821](https://github.com/nats-io/nats-server/pull/5821/files#diff-2f4991438bb868a8587303cde9107f83127e88ad70bd19d5c6a31c238a20c299R4694-R4699)
(August, 2024)

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants