replicate createpg message across raft group #136

JacksonYao287 · 2024-01-25T02:59:07Z

we need to replicate create PG message across the rafte group, so that the follower can create pginfo and index table.

codecov-commenter · 2024-01-25T03:11:34Z

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (c3a1459) 72.43% compared to head (fba3006) 72.23%.

Files	Patch %	Lines
src/lib/homestore_backend/hs_pg_manager.cpp	66.29%	23 Missing and 7 partials ⚠️
src/include/homeobject/pg_manager.hpp	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #136      +/-   ##
==========================================
- Coverage   72.43%   72.23%   -0.21%     
==========================================
  Files          30       30              
  Lines        1295     1390      +95     
  Branches      131      143      +12     
==========================================
+ Hits          938     1004      +66     
- Misses        278      300      +22     
- Partials       79       86       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/include/homeobject/pg_manager.hpp

src/lib/homestore_backend/hs_pg_manager.cpp

szmyd · 2024-01-25T15:40:08Z

src/lib/homestore_backend/hs_pg_manager.cpp

            if (v.hasError()) { return folly::makeUnexpected(toPgError(v.error())); }
+            // we will write a PGHeader to the raft group and commit it when PGHeader is commited on raft
+            // group. so that all raft members will create the PGinfo and index table for this PG.
+            return do_create_pg(v.value(), std::move(pg_info));


What do we do if this fails? We need to handle that and either offload it somewhere to retry until it works; or unwind this group optimistically so we don't leak a repl_dev.

good question!

i do not think we should retry, since if it crashes after create_repl_dev and before sending the createPG message, when restarting, the pginfo is lost, the repl_dev does not know which PG it should belongs.

my solution is
1 if do_create_pg fails, we just delete the repl_dev(unwind this group). create_repl_dev is not frequnent and not in io path, so even if this will be time-consuming, it is acceptable!

2 when restarting and after applying all the logs, get the repl_dev list from Homestore and iterate all the pgs to to find out if there is any repl_dev belonging to no pg and delete this repl_dev.

if we do this , we need to add two apis to Homestore:
1 get_repl_dev_list(repl_dev_type t);
2 delete_repl_dev(uuid )

I think this can make sure we do not leak any repl_dev

cc @hkadayam

seems like we put group_id in the repl_dev superblock, can that be used to retrieve the pginfo from HomeObject?

@xiaoxichen
the gorup_id is a randomly generated uuid, and the pg_id is a uint_16.
but yes, we can generate uuid from uint_16, and get unit_16 from previously generated uuid, so that we get a natural map between uuid and uint_16.

the key point is that the pginfo originally exists in the create_pg_message, which need to be replicated across the repl_dev(raft group). if this message is not replicated to the majority of this group, then it is not committed. when restarts, the uncommitted message will probabaly be dropped，and repl_dev can not get pginfo from that dropped message.

I am trying to say if we can avoid replicating pginfo at all to avoid the 2PC in this PR proposal. If we created the repl_dev we are all set, followers can construct pginfo from repl_dev. During recovery the pginfo also can be recovered by the sb of repl_dev/raft_config;

the 3 required fields we already have
members which can get from raft cluster config, the replica_set_uuid is group_id, the only thing needed is pg_id.

As you said , we can

encode pg_id into replica_set_uuid so we get pg_id out from replicat_set_uuid

put the pg_id into _user_ctx field of libnuraft::cluster_config, c.f link

put the entire serailized_pg_info into _user_ctx

2/3 can be implemented as the cluster_config was loaded by calling RaftReplDev::load_config().

struct PGInfo { explicit PGInfo(pg_id_t _id) : id(_id) {} pg_id_t id; mutable MemberSet members; peer_id_t replica_set_uuid; auto operator<=>(PGInfo const& rhs) const { return id <=> rhs.id; } auto operator==(PGInfo const& rhs) const { return id == rhs.id; } };

as a summary, there are two problems:
1 HS side: how to handle create repl_dev failure(for example , fail to add any member to the raft group)?
this is the responsibility of HS, we should not consider this in HO side.

2 HO side: if we get a good repl_dev from HS, but we fail to create PG(for example, createPG message fail to be replicated to the majority of the raft group and is not committed, or crash happens at some moment ), how to do cleanup to avoid repl_dev leak?

according to the meeting today, this will be done in phrase three and by that time we will have my new features in homestore and homeobject, which are probably very helpful for solving this problem. so what about just leave the problem here for now (only happy path) and revisit it in phase 3.
@xiaoxichen @szmyd @sanebay

Yes lets create an issue and put a #FIXME in the code

sure , done

@xiaoxichen yes i know; i brought this up initially. Actually; there is one problem that is the generalized form of all problems:

I (SM) belong to a raft group (repl_dev) that I am not managing a PG for according to truth (CM).

All issues regarding repl_dev failure, pg_creation failure, migration during offline member (orphaned pg)...they generalize to this. The existence of a PG under that raft group is inconsequential. The solution is what I suggest, anything else is an optimization for a specialization of this issue and need not take the risk of complicating machinery just to correct something that is not in the immediate necessary to correct.

This is from years of fighting the raft membership issues with AM; we do try to cleanup in the cases that we can, but don't worry ourselves about it as the global cleanup routine will correct things. Many attempts to optimize resulted in out-of-quroum groups and deadlocks in the code.

We should just do what's easy here to start so things are correct and we can asses if there are timing issues that warrant an optimization beyond fire-and-forget cleanup.

@szmyd yes you said makes perfect sense. The raft membership change in SM would be less than AM but still reconcile based on CM's PGMap is the right path. There are some timing issue we need to be careful though.

src/lib/homestore_backend/hs_pg_manager.cpp

sanebay · 2024-01-25T19:28:58Z

src/lib/homestore_backend/replication_state_machine.cpp

@@ -7,6 +7,10 @@ void ReplicationStateMachine::on_commit(int64_t lsn, const sisl::blob& header, c
    LOGI("applying raft log commit with lsn:{}", lsn);
    const ReplicationMessageHeader* msg_header = r_cast< const ReplicationMessageHeader* >(header.cbytes());
    switch (msg_header->msg_type) {
+    case ReplicationMessageType::CREATE_PG_MSG: {


Could you please add tests ?

i think the existing UT has covered this case

HomeObject/src/lib/tests/fixture_app.cpp

Line 41 in c3a1459

EXPECT_TRUE(homeobj_->pg_manager()->create_pg(std::move(info)).get());

do we need add more specific test case？

sanebay · 2024-01-25T19:29:21Z

src/lib/homestore_backend/hs_pg_manager.cpp

+    repl_dev->async_read(blkids, value, value.size)
+        .thenValue([this, lsn, msg_header = *header_ptr, blkids, value](auto&& err) mutable {
+            if (err) {
+                LOGW("failed to read data from homestore pba, lsn:{}", lsn);


Should we set error in ctx ?

no, if we come here, it is a log reapplicaition in recovery. so hs_ctx is null, and we can not get ctx

This is the code I referred to earlier, we don't need async_read(), if we write as much information as needed in header. This is true for other use cases to replicate (except shard create and blob put)

@hkadayam yes you said make sense.
is that makes sense if we can let small IO (< pagesize ? or even <16K etc) to through raft? most of the metadata ops like createpg/createshard etc can go through this .

// If it is header only entry, directly propose to the raft if (rreq->value.size > MIN_DATACHANNEL_VALUESIZE) { // go through data channel } else { // propose through raft. RD_LOG(INFO, "Skipping data channel send since value size is 0"); rreq->state.fetch_or(uint32_cast(repl_req_state_t::DATA_WRITTEN)); m_state_machine->propose_to_raft(std::move(rreq)); }

JacksonYao287 · 2024-01-27T13:16:16Z

I have manually tested this PR with tree replicas. it works as expected!

@szmyd
all the tests are passed in my local env, seem the CI failure is not caused by this PR itself, could you help taking a look?

src/lib/homestore_backend/hs_pg_manager.cpp

JacksonYao287 · 2024-01-30T14:06:05Z

this PR can be successfully built after https://github.com/eBay/HomeStore/pull/284/files is merged

JacksonYao287 self-assigned this Jan 25, 2024

JacksonYao287 requested review from sanebay and szmyd January 25, 2024 02:59

JacksonYao287 changed the title ~~replicat createpg message across raft group~~ replicate createpg message across raft group Jan 25, 2024

JacksonYao287 force-pushed the replcate-createPG-message branch from fd47953 to fa3d2a4 Compare January 25, 2024 05:24

szmyd reviewed Jan 25, 2024

View reviewed changes

src/include/homeobject/pg_manager.hpp Outdated Show resolved Hide resolved

szmyd reviewed Jan 25, 2024

View reviewed changes

src/lib/homestore_backend/hs_pg_manager.cpp Outdated Show resolved Hide resolved

szmyd reviewed Jan 25, 2024

View reviewed changes

src/lib/homestore_backend/hs_pg_manager.cpp Outdated Show resolved Hide resolved

szmyd reviewed Jan 25, 2024

View reviewed changes

src/lib/homestore_backend/hs_pg_manager.cpp Outdated Show resolved Hide resolved

sanebay reviewed Jan 25, 2024

View reviewed changes

JacksonYao287 force-pushed the replcate-createPG-message branch from c5d05ac to a3220c7 Compare January 27, 2024 15:25

JacksonYao287 requested review from szmyd and sanebay January 29, 2024 06:15

JacksonYao287 mentioned this pull request Jan 30, 2024

Handle CreatePg failure to avoid repl_dev leak #137

Closed

hkadayam reviewed Jan 30, 2024

View reviewed changes

src/lib/homestore_backend/hs_pg_manager.cpp Outdated Show resolved Hide resolved

JacksonYao287 requested a review from hkadayam January 30, 2024 14:04

JacksonYao287 added this to the MileStone4.1 milestone Feb 1, 2024

JacksonYao287 force-pushed the replcate-createPG-message branch from 78ed542 to e6dd32e Compare February 2, 2024 02:28

JacksonYao287 added 8 commits February 1, 2024 19:34

replicate createPG message across raft group

6c9e9a9

update

48ac71d

fix comments

ec557fa

fix error

ed7da97

fix shard message error

876adf4

fix shard and blol error at follower side

66c0d0f

add FIX_ME

59f6b8f

put pginfo in header

4f53223

using latest homestore

4c75d29

JacksonYao287 force-pushed the replcate-createPG-message branch from e6dd32e to 4c75d29 Compare February 2, 2024 15:56

update

4439fec

hkadayam merged commit 135d76e into eBay:main Feb 13, 2024
20 of 24 checks passed

JacksonYao287 deleted the replcate-createPG-message branch February 15, 2024 02:50

xiaoxichen mentioned this pull request Mar 15, 2024

Reconsile PG/Repl-dev based on the PGMap from CM #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replicate createpg message across raft group #136

replicate createpg message across raft group #136

JacksonYao287 commented Jan 25, 2024

codecov-commenter commented Jan 25, 2024 •

edited

Loading

szmyd Jan 25, 2024

JacksonYao287 Jan 26, 2024 •

edited

Loading

xiaoxichen Jan 26, 2024

JacksonYao287 Jan 27, 2024

xiaoxichen Jan 27, 2024 •

edited

Loading

JacksonYao287 Jan 30, 2024 •

edited

Loading

xiaoxichen Jan 30, 2024

JacksonYao287 Jan 30, 2024

szmyd Jan 30, 2024 •

edited

Loading

xiaoxichen Jan 30, 2024

sanebay Jan 25, 2024

JacksonYao287 Jan 26, 2024 •

edited

Loading

sanebay Jan 25, 2024

JacksonYao287 Jan 26, 2024

hkadayam Jan 30, 2024

xiaoxichen Jan 30, 2024

JacksonYao287 commented Jan 27, 2024 •

edited

Loading

JacksonYao287 commented Jan 30, 2024

replicate createpg message across raft group #136

replicate createpg message across raft group #136

Conversation

JacksonYao287 commented Jan 25, 2024

codecov-commenter commented Jan 25, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

JacksonYao287 Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxichen Jan 27, 2024 • edited Loading

Choose a reason for hiding this comment

JacksonYao287 Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szmyd Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonYao287 Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonYao287 commented Jan 27, 2024 • edited Loading

JacksonYao287 commented Jan 30, 2024

codecov-commenter commented Jan 25, 2024 •

edited

Loading

JacksonYao287 Jan 26, 2024 •

edited

Loading

xiaoxichen Jan 27, 2024 •

edited

Loading

JacksonYao287 Jan 30, 2024 •

edited

Loading

szmyd Jan 30, 2024 •

edited

Loading

JacksonYao287 Jan 26, 2024 •

edited

Loading

JacksonYao287 commented Jan 27, 2024 •

edited

Loading