-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Baseline Resync #596
Conversation
9d3b270
to
2b26867
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand correctly , this pr aims to update the m_next_dsn in the snapshot receiver side, right?
for the receiver , after baseline resync is completed, incremental resync will start, where new log entries will be recevied and committed. according to the code here
HomeStore/src/lib/replication/repl_dev/raft_repl_dev.cpp
Lines 933 to 934 in 6a2dfd8
m_next_dsn.compare_exchange_strong(cur_dsn, rreq->dsn() + 1); | |
} |
m_next_dsn will be updated by handle_commit (committing the received log entry), and thus m_next_dsn will eventually be up-to-date as leader. so , I am not sure is it necessary to add a separate step to sync m_next_dsn.
pls correct me if I am wrong or miss something
We discussed this case in the review, The case is,
In this sequence the incremental resync will not kicked in to update the m_next_dsn. Yes now it only update the |
yes , theoretically , this will happen. moreover , we need to trigger a cp_flush after baseline resync is done to guarantee m_next_dsn is persisted to metaservice to prevent this from happening even if F1 restarts. cc @yuwmao |
Ok, in our design, the last obj is an empty message, let's call cp_flush in the last message. |
56b2b3b
to
a5f8054
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #596 +/- ##
===========================================
+ Coverage 56.51% 66.54% +10.03%
===========================================
Files 108 109 +1
Lines 10300 10810 +510
Branches 1402 1476 +74
===========================================
+ Hits 5821 7194 +1373
+ Misses 3894 2906 -988
- Partials 585 710 +125 ☔ View full report in Codecov by Sentry. |
I would suggest do that in apply_snapshot(), we also update lsn there. |
Ok, make sense. |
a5f8054
to
fb3461f
Compare
4b81c3f
to
a0f7309
Compare
For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer. We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application. In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case: 1. Leader sends snapshot at LSN T1 to follower F1. 2. F1 fully receives the snapshot and now at T1. 3. Leader yield its leadership, F1 elected as leader. In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.
a0f7309
to
d973879
Compare
d973879
to
0cb90b8
Compare
A general comment: is there a plan to add unit test case (with flip) for the error path, such as read_snapshot_obj, apply_snapshot_obj, save_snapshot_obj, etc to verify the retry logic will be happening in the baseline resync flow? |
This is the adaptive change to homestore eBay/HomeStore#596
For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer.
We use the first bit of the
obj_id
to indicate the message type: 0 is for HS, 1 is for Application.In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case:
In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.