Implementation of Baseline Resync #248

koujl · 2024-12-26T10:02:25Z

This PR contains all changes from the branch baseline_resync.

The commit history has been reworked for simplicity and clarity.

* Add snapshot related msg protocol / struct and framework for snapshot resync - pg_blob_iterator is the snapshot resync context for leader(read path) - SnapshotReceiveHandler is a newly added structure as the snapshot context for follower(write path) - Comment previous implementation codes

Implemented ReplicationStateMachine::write_snapshot_data() method and SnapshotReceiveHandler logic. Additionally, extracted local_create_pg, local_create_shard, and local_add_blob_info functions for follower data creation.

* Add UT for PGBlobIterator * fix comments

Extracted SnapshotContext from SnapshotReceiveHandler and added UT for SnapshotReceiveHandler. Additionally, fixed several bugs in the SnapshotReceiveHandler logic.

Update baseline_resync branch with latest commits

Fix some bugs in baseline resync

* Enhance SnapshotReceiveHandler UTs * Add test and flip for snapshot resync

* Blob duplication handling This commit addresses blob duplication issues caused by resync, it mainly happens when resend snapshot mesg during baseline resync and apply log after snapshot completion. This helps avoid unnecessary GC due to duplicated data. This mechanism is effective only for duplicated blobs. Since we do not record the blks of the `shardInfo` stored in the data service, we are unable to skip data writes for duplicated shards. * Reset PG after failures

codecov-commenter · 2024-12-27T02:20:48Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 59.30902% with 212 lines in your changes missing coverage. Please review.

Project coverage is 64.14%. Comparing base (1746bcc) to head (9ecdab4).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
...ib/homestore_backend/replication_state_machine.cpp	16.85%	68 Missing and 6 partials ⚠️
...lib/homestore_backend/snapshot_receive_handler.cpp	60.62%	47 Missing and 16 partials ⚠️
src/lib/homestore_backend/pg_blob_iterator.cpp	80.14%	17 Missing and 10 partials ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp	59.37%	8 Missing and 5 partials ⚠️
src/lib/homestore_backend/replication_message.hpp	59.37%	10 Missing and 3 partials ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp	67.64%	8 Missing and 3 partials ⚠️
src/lib/homestore_backend/heap_chunk_selector.cpp	57.14%	4 Missing and 2 partials ⚠️
src/lib/homestore_backend/index_kv.cpp	25.00%	2 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp	89.47%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #248      +/-   ##
==========================================
+ Coverage   63.15%   64.14%   +0.98%     
==========================================
  Files          32       33       +1     
  Lines        1900     2418     +518     
  Branches      204      281      +77     
==========================================
+ Hits         1200     1551     +351     
- Misses        600      720     +120     
- Partials      100      147      +47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yuwmao

LGTM

src/lib/homestore_backend/replication_message.hpp

src/lib/homestore_backend/hs_shard_manager.cpp

Besroy · 2024-12-27T07:51:36Z

src/lib/homestore_backend/hs_blob_manager.cpp

+    bool success = local_add_blob_info(pg_id, blob_info);
+
+    if (ctx) {
+        ctx->promise_.setValue(success ? BlobManager::Result< BlobInfo >(blob_info)


Could you pls help me understand why change the condition from 'status != success && !exist_alreay' to 'status!=success'? If add_to_index_table is failed because of already_exist, what is expect to happen? (or will it happen?)

Previously, if the blob already exists in index table, add_to_index_table fails but on_blob_put_commit returns success. I think it is intended to pass the replay logs scenario. But actually, there might be 2 cases: pbas is the same as the existing pbas and pbas is a different one(maybe by some mistakes). Currently, in add_to_index_table func, if the pbas is the same as the old one, it will return success, so we don't need && !exist_already condition.

And actually it's not quite clear at this point in what cases we may encounter a conflict value in the index table. Let's wait for further tests to uncover more corner cases and optimize our handling accordingly.

Besroy · 2024-12-27T08:49:43Z

src/lib/homestore_backend/pg_blob_iterator.cpp

        }
+        // sort shard list by <vchunkid, lsn> to ensure open shards positioned after sealed shards
+        std::ranges::sort(shard_list_, [](ShardEntry& a, ShardEntry& b) {
+            return a.v_chunk_num != b.v_chunk_num ? a.v_chunk_num < b.v_chunk_num : a.info.lsn < b.info.lsn;


If I understood correctly, here assume shards with smaller v_chunk_num is more likely to be sealed. Could you pls explain the reason? IIRC, shard is selected by available size (Besides, v_chunk_id may be easier to understand than v_chunk_num, as they are same)

We don't care about the shard sequence across different chunks. This is just to ensure that within each chunk the opened shard is positioned last.

Let me edit the comment for clarity in the last push.

koujl · 2024-12-27T09:00:38Z

To keep the commit log history clean, all additional changes in this PR will be placed in the latest commit Preparation for merging into main.

Besroy · 2024-12-27T09:03:59Z

src/lib/homestore_backend/replication_state_machine.cpp

+        //Check if pg exists, if yes, clean the stale pg resources, may be due to previous snapshot failure. Let's resync on a pristine base
+        if (home_object_->pg_exists(pg_data->pg_id())) {
+            LOGI("pg already exists, clean pg resources before snapshot, pg_id:{} {}", pg_data->pg_id(), log_suffix);
+            home_object_->pg_destroy(pg_data->pg_id());


Is this the reason for src/lib/homestore_backend/hs_blob_manager.cpp:L363-369?

src/lib/homestore_backend/hs_blob_manager.cpp:L363-369 is more like a general defensive safety check, otherwise if pg is not exists, it can not proceed. Similar to the existing shard check.
In my understanding, reset pg and create pg is the first msg before create blob in the baseline resync(if fail, will retry the same msg), so this line should not fail that check.

Besroy · 2024-12-27T09:44:58Z

I have an additional question regarding the baseline rsync: If the leader considers timeout during the baseline rsync, what will happen in leader and follower side? If leader resends snapshot, will the follower recreate a new snapshot? Or leader will not resend --- raft will try to re-read-save the snapshot obj until the obj is successfully written or fail?

yuwmao and others added 14 commits December 26, 2024 16:26

Implement read snapshot data logic (eBay#224)

9b141c3

Add UT for PGBlobIterator (eBay#228)

d476b6b

* Add UT for PGBlobIterator * fix comments

Refactor SnapshotReceiveHandler & Add UT (eBay#232)

f6792cb

Extracted SnapshotContext from SnapshotReceiveHandler and added UT for SnapshotReceiveHandler. Additionally, fixed several bugs in the SnapshotReceiveHandler logic.

Merge remote-tracking branch 'origin/main' into baseline_resync

1f9e01d

Merge remote-tracking branch 'origin/main' into baseline_resync

9734509

Merge pull request eBay#234 from koujl/baseline_resync

5897cdc

Update baseline_resync branch with latest commits

Merge branch 'main' into baseline_main

b568c0d

Rename snapshot_data to snapshot_obj & fix bugs

d59cf6e

Merge pull request eBay#236 from yuwmao/baseline_resync

18ee9f5

Fix some bugs in baseline resync

Merge remote-tracking branch 'origin/main' into resync_rebase

8bf390c

Add more UT for error cases simulated with sisl::flip (eBay#243)

f168de9

* Enhance SnapshotReceiveHandler UTs * Add test and flip for snapshot resync

koujl requested review from xiaoxichen, Hooper9973, JacksonYao287 and yuwmao December 26, 2024 10:02

yuwmao previously approved these changes Dec 27, 2024

View reviewed changes

Hooper9973 reviewed Dec 27, 2024

View reviewed changes

src/lib/homestore_backend/replication_message.hpp Show resolved Hide resolved

src/lib/homestore_backend/hs_shard_manager.cpp Outdated Show resolved Hide resolved

xiaoxichen requested review from yamingk, sanebay and Besroy December 27, 2024 06:54

Besroy reviewed Dec 27, 2024

View reviewed changes

koujl dismissed yuwmao’s stale review via 1cbde17 December 27, 2024 08:50

koujl force-pushed the resync_rebase branch from 9ecdab4 to 1cbde17 Compare December 27, 2024 08:50

Preparation for merging into main

1a4c2e2

koujl force-pushed the resync_rebase branch from 1cbde17 to 1a4c2e2 Compare December 27, 2024 08:58

Besroy reviewed Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Baseline Resync #248

Implementation of Baseline Resync #248

koujl commented Dec 26, 2024

codecov-commenter commented Dec 27, 2024

yuwmao left a comment

Besroy Dec 27, 2024 •

edited

Loading

yuwmao Dec 27, 2024

koujl Dec 27, 2024

Besroy Dec 27, 2024 •

edited

Loading

koujl Dec 27, 2024

koujl commented Dec 27, 2024

Besroy Dec 27, 2024

yuwmao Dec 27, 2024 •

edited

Loading

Besroy commented Dec 27, 2024

Implementation of Baseline Resync #248

Are you sure you want to change the base?

Implementation of Baseline Resync #248

Conversation

koujl commented Dec 26, 2024

codecov-commenter commented Dec 27, 2024

Codecov Report

yuwmao left a comment

Choose a reason for hiding this comment

Besroy Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

yuwmao Dec 27, 2024

Choose a reason for hiding this comment

koujl Dec 27, 2024

Choose a reason for hiding this comment

Besroy Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

koujl Dec 27, 2024

Choose a reason for hiding this comment

koujl commented Dec 27, 2024

Besroy Dec 27, 2024

Choose a reason for hiding this comment

yuwmao Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Besroy commented Dec 27, 2024

Besroy Dec 27, 2024 •

edited

Loading

Besroy Dec 27, 2024 •

edited

Loading

yuwmao Dec 27, 2024 •

edited

Loading