-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replicate createpg message across raft group #136
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #136 +/- ##
==========================================
- Coverage 72.43% 72.23% -0.21%
==========================================
Files 30 30
Lines 1295 1390 +95
Branches 131 143 +12
==========================================
+ Hits 938 1004 +66
- Misses 278 300 +22
- Partials 79 86 +7 ☔ View full report in Codecov by Sentry. |
fd47953
to
fa3d2a4
Compare
if (v.hasError()) { return folly::makeUnexpected(toPgError(v.error())); } | ||
// we will write a PGHeader to the raft group and commit it when PGHeader is commited on raft | ||
// group. so that all raft members will create the PGinfo and index table for this PG. | ||
return do_create_pg(v.value(), std::move(pg_info)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do if this fails? We need to handle that and either offload it somewhere to retry until it works; or unwind this group optimistically so we don't leak a repl_dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question!
i do not think we should retry, since if it crashes after create_repl_dev
and before sending the createPG message, when restarting, the pginfo is lost, the repl_dev does not know which PG it should belongs.
my solution is
1 if do_create_pg
fails, we just delete the repl_dev(unwind this group). create_repl_dev is not frequnent and not in io path, so even if this will be time-consuming, it is acceptable!
2 when restarting and after applying all the logs, get the repl_dev list from Homestore and iterate all the pgs to to find out if there is any repl_dev belonging to no pg and delete this repl_dev.
if we do this , we need to add two apis to Homestore:
1 get_repl_dev_list(repl_dev_type t);
2 delete_repl_dev(uuid )
I think this can make sure we do not leak any repl_dev
cc @hkadayam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxichen
the gorup_id
is a randomly generated uuid, and the pg_id
is a uint_16.
but yes, we can generate uuid
from uint_16
, and get unit_16
from previously generated uuid
, so that we get a natural map between uuid
and uint_16
.
the key point is that the pginfo originally exists in the create_pg_message
, which need to be replicated across the repl_dev(raft group). if this message is not replicated to the majority of this group, then it is not committed. when restarts, the uncommitted message will probabaly be dropped,and repl_dev can not get pginfo from that dropped message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to say if we can avoid replicating pginfo
at all to avoid the 2PC in this PR proposal. If we created the repl_dev we are all set, followers can construct pginfo
from repl_dev. During recovery the pginfo
also can be recovered by the sb of repl_dev/raft_config;
the 3 required fields we already have
members
which can get from raft cluster config, the replica_set_uuid
is group_id
, the only thing needed is pg_id
.
As you said , we can
- encode pg_id into
replica_set_uuid
so we getpg_id
out fromreplicat_set_uuid
- put the pg_id into
_user_ctx
field oflibnuraft::cluster_config
, c.f link - put the entire
serailized_pg_info
into_user_ctx
2/3 can be implemented as the cluster_config was loaded by calling RaftReplDev::load_config()
.
struct PGInfo {
explicit PGInfo(pg_id_t _id) : id(_id) {}
pg_id_t id;
mutable MemberSet members;
peer_id_t replica_set_uuid;
auto operator<=>(PGInfo const& rhs) const { return id <=> rhs.id; }
auto operator==(PGInfo const& rhs) const { return id == rhs.id; }
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a summary, there are two problems:
1 HS side: how to handle create repl_dev failure(for example , fail to add any member to the raft group)?
this is the responsibility of HS, we should not consider this in HO side.
2 HO side: if we get a good repl_dev from HS, but we fail to create PG(for example, createPG message fail to be replicated to the majority of the raft group and is not committed, or crash happens at some moment ), how to do cleanup to avoid repl_dev leak?
according to the meeting today, this will be done in phrase three and by that time we will have my new features in homestore and homeobject, which are probably very helpful for solving this problem. so what about just leave the problem here for now (only happy path) and revisit it in phase 3.
@xiaoxichen @szmyd @sanebay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes lets create an issue and put a #FIXME in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure , done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxichen yes i know; i brought this up initially. Actually; there is one problem that is the generalized form of all problems:
- I (SM) belong to a raft group (repl_dev) that I am not managing a PG for according to truth (CM).
All issues regarding repl_dev failure, pg_creation failure, migration during offline member (orphaned pg)...they generalize to this. The existence of a PG under that raft group is inconsequential. The solution is what I suggest, anything else is an optimization for a specialization of this issue and need not take the risk of complicating machinery just to correct something that is not in the immediate necessary to correct.
This is from years of fighting the raft membership issues with AM; we do try to cleanup in the cases that we can, but don't worry ourselves about it as the global cleanup routine will correct things. Many attempts to optimize resulted in out-of-quroum groups and deadlocks in the code.
We should just do what's easy here to start so things are correct and we can asses if there are timing issues that warrant an optimization beyond fire-and-forget cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szmyd yes you said makes perfect sense. The raft membership change in SM would be less than AM but still reconcile based on CM's PGMap is the right path. There are some timing issue we need to be careful though.
@@ -7,6 +7,10 @@ void ReplicationStateMachine::on_commit(int64_t lsn, const sisl::blob& header, c | |||
LOGI("applying raft log commit with lsn:{}", lsn); | |||
const ReplicationMessageHeader* msg_header = r_cast< const ReplicationMessageHeader* >(header.cbytes()); | |||
switch (msg_header->msg_type) { | |||
case ReplicationMessageType::CREATE_PG_MSG: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the existing UT has covered this case
HomeObject/src/lib/tests/fixture_app.cpp
Line 41 in c3a1459
EXPECT_TRUE(homeobj_->pg_manager()->create_pg(std::move(info)).get()); |
do we need add more specific test case?
repl_dev->async_read(blkids, value, value.size) | ||
.thenValue([this, lsn, msg_header = *header_ptr, blkids, value](auto&& err) mutable { | ||
if (err) { | ||
LOGW("failed to read data from homestore pba, lsn:{}", lsn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set error in ctx ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if we come here, it is a log reapplicaition in recovery. so hs_ctx is null, and we can not get ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the code I referred to earlier, we don't need async_read(), if we write as much information as needed in header. This is true for other use cases to replicate (except shard create and blob put)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hkadayam yes you said make sense.
is that makes sense if we can let small IO (< pagesize ? or even <16K etc) to through raft? most of the metadata ops like createpg/createshard etc can go through this .
// If it is header only entry, directly propose to the raft
if (rreq->value.size > MIN_DATACHANNEL_VALUESIZE) {
// go through data channel
} else {
// propose through raft.
RD_LOG(INFO, "Skipping data channel send since value size is 0");
rreq->state.fetch_or(uint32_cast(repl_req_state_t::DATA_WRITTEN));
m_state_machine->propose_to_raft(std::move(rreq));
}
I have manually tested this PR with tree replicas. it works as expected! @szmyd |
c5d05ac
to
a3220c7
Compare
this PR can be successfully built after https://github.com/eBay/HomeStore/pull/284/files is merged |
78ed542
to
e6dd32e
Compare
e6dd32e
to
4c75d29
Compare
we need to replicate create PG message across the rafte group, so that the follower can create pginfo and index table.