Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Baseline Resync #596

Merged
merged 4 commits into from
Dec 4, 2024
Merged

Support Baseline Resync #596

merged 4 commits into from
Dec 4, 2024

Conversation

yuwmao
Copy link
Contributor

@yuwmao yuwmao commented Nov 25, 2024

For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer.
We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application.

In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case:

  • Leader sends snapshot at LSN T1 to follower F1.
  • F1 fully receives the snapshot and now at T1.
  • Leader yield its leadership, F1 elected as leader.
    In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.

@yuwmao yuwmao force-pushed the baseline_resync branch 3 times, most recently from 9d3b270 to 2b26867 Compare November 25, 2024 13:43
Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly , this pr aims to update the m_next_dsn in the snapshot receiver side, right?

for the receiver , after baseline resync is completed, incremental resync will start, where new log entries will be recevied and committed. according to the code here

m_next_dsn.compare_exchange_strong(cur_dsn, rreq->dsn() + 1);
}

m_next_dsn will be updated by handle_commit (committing the received log entry), and thus m_next_dsn will eventually be up-to-date as leader. so , I am not sure is it necessary to add a separate step to sync m_next_dsn.

pls correct me if I am wrong or miss something

src/lib/replication/repl_dev/raft_repl_dev.cpp Outdated Show resolved Hide resolved
src/include/homestore/replication/repl_dev.h Show resolved Hide resolved
@xiaoxichen
Copy link
Collaborator

We discussed this case in the review,

The case is,

  • Leader sends snapshot at LSN T1 to follower F1.
  • F1 fully receives the snapshot and now at T1.
  • Leader yield its leadership, F1 elected as leader.

In this sequence the incremental resync will not kicked in to update the m_next_dsn.

Yes now it only update the m_next_dsn but more generically , it is a channel for baseline resync at HS level.

@JacksonYao287
Copy link
Contributor

Leader yield its leadership, F1 elected as leader.

yes , theoretically , this will happen. moreover , we need to trigger a cp_flush after baseline resync is done to guarantee m_next_dsn is persisted to metaservice to prevent this from happening even if F1 restarts. cc @yuwmao

@yuwmao
Copy link
Contributor Author

yuwmao commented Nov 28, 2024

Leader yield its leadership, F1 elected as leader.

yes , theoretically , this will happen. moreover , we need to trigger a cp_flush after baseline resync is done to guarantee m_next_dsn is persisted to metaservice to prevent this from happening even if F1 restarts. cc @yuwmao

Ok, in our design, the last obj is an empty message, let's call cp_flush in the last message.

@yuwmao yuwmao force-pushed the baseline_resync branch 2 times, most recently from 56b2b3b to a5f8054 Compare November 28, 2024 08:55
@yuwmao yuwmao requested review from yamingk and sanebay November 28, 2024 08:55
@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 34.88372% with 28 lines in your changes missing coverage. Please review.

Project coverage is 66.54%. Comparing base (1a0cef8) to head (0cb90b8).
Report is 97 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 30.76% 18 Missing ⚠️
...rc/lib/replication/repl_dev/raft_state_machine.cpp 37.50% 10 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master     #596       +/-   ##
===========================================
+ Coverage   56.51%   66.54%   +10.03%     
===========================================
  Files         108      109        +1     
  Lines       10300    10810      +510     
  Branches     1402     1476       +74     
===========================================
+ Hits         5821     7194     +1373     
+ Misses       3894     2906      -988     
- Partials      585      710      +125     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xiaoxichen
Copy link
Collaborator

Leader yield its leadership, F1 elected as leader.

yes , theoretically , this will happen. moreover , we need to trigger a cp_flush after baseline resync is done to guarantee m_next_dsn is persisted to metaservice to prevent this from happening even if F1 restarts. cc @yuwmao

Ok, in our design, the last obj is an empty message, let's call cp_flush in the last message.

I would suggest do that in apply_snapshot(), we also update lsn there.

@yuwmao
Copy link
Contributor Author

yuwmao commented Nov 28, 2024

Leader yield its leadership, F1 elected as leader.

yes , theoretically , this will happen. moreover , we need to trigger a cp_flush after baseline resync is done to guarantee m_next_dsn is persisted to metaservice to prevent this from happening even if F1 restarts. cc @yuwmao

Ok, in our design, the last obj is an empty message, let's call cp_flush in the last message.

I would suggest do that in apply_snapshot(), we also update lsn there.

Ok, make sense.

For Nuraft baseline resync, we separate the process into two layers: HomeStore layer and Application layer.
We use the first bit of the obj_id to indicate the message type: 0 is for HS, 1 is for Application.

In the HomeStore layer, leader needs to transmit the DSN to the follower, this is intended to handle the following case:

1. Leader sends snapshot at LSN T1 to follower F1.
2. F1 fully receives the snapshot and now at T1.
3. Leader yield its leadership, F1 elected as leader.
In this sequence the incremental resync will not kicked in to update the m_next_dsn, and as result, duplication may occur.
@yamingk
Copy link
Contributor

yamingk commented Dec 3, 2024

A general comment: is there a plan to add unit test case (with flip) for the error path, such as read_snapshot_obj, apply_snapshot_obj, save_snapshot_obj, etc to verify the retry logic will be happening in the baseline resync flow?
It is okay to do it with another PR.

@xiaoxichen xiaoxichen merged commit 89b86ff into eBay:master Dec 4, 2024
21 checks passed
koujl pushed a commit to eBay/HomeObject that referenced this pull request Dec 9, 2024
This is the adaptive change to homestore eBay/HomeStore#596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants