-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replicate createpg message across raft group #136
Changes from all commits
6c9e9a9
48ac71d
ec557fa
ed7da97
876adf4
66c0d0f
59f6b8f
4f53223
4c75d29
4439fec
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -7,6 +7,10 @@ void ReplicationStateMachine::on_commit(int64_t lsn, const sisl::blob& header, c | |||
LOGI("applying raft log commit with lsn:{}", lsn); | ||||
const ReplicationMessageHeader* msg_header = r_cast< const ReplicationMessageHeader* >(header.cbytes()); | ||||
switch (msg_header->msg_type) { | ||||
case ReplicationMessageType::CREATE_PG_MSG: { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please add tests ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think the existing UT has covered this case HomeObject/src/lib/tests/fixture_app.cpp Line 41 in c3a1459
do we need add more specific test case? |
||||
home_object_->on_create_pg_message_commit(lsn, header, repl_dev(), ctx); | ||||
break; | ||||
} | ||||
case ReplicationMessageType::CREATE_SHARD_MSG: | ||||
case ReplicationMessageType::SEAL_SHARD_MSG: { | ||||
home_object_->on_shard_message_commit(lsn, header, pbas, repl_dev(), ctx); | ||||
|
@@ -40,6 +44,12 @@ void ReplicationStateMachine::on_rollback(int64_t lsn, sisl::blob const&, sisl:: | |||
LOGI("on_rollback with lsn:{}", lsn); | ||||
} | ||||
|
||||
void ReplicationStateMachine::on_error(ReplServiceError error, const sisl::blob& header, const sisl::blob& key, | ||||
cintrusive< repl_req_ctx >& ctx) { | ||||
LOGE("on_error with :{}, lsn {}", error, ctx->lsn); | ||||
// TODO:: block is already freeed at homestore side, handle error if necessay. | ||||
} | ||||
|
||||
homestore::blk_alloc_hints ReplicationStateMachine::get_blk_alloc_hints(sisl::blob const& header, uint32_t data_size) { | ||||
const ReplicationMessageHeader* msg_header = r_cast< const ReplicationMessageHeader* >(header.cbytes()); | ||||
switch (msg_header->msg_type) { | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do if this fails? We need to handle that and either offload it somewhere to retry until it works; or unwind this group optimistically so we don't leak a repl_dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question!
i do not think we should retry, since if it crashes after
create_repl_dev
and before sending the createPG message, when restarting, the pginfo is lost, the repl_dev does not know which PG it should belongs.my solution is
1 if
do_create_pg
fails, we just delete the repl_dev(unwind this group). create_repl_dev is not frequnent and not in io path, so even if this will be time-consuming, it is acceptable!2 when restarting and after applying all the logs, get the repl_dev list from Homestore and iterate all the pgs to to find out if there is any repl_dev belonging to no pg and delete this repl_dev.
if we do this , we need to add two apis to Homestore:
1 get_repl_dev_list(repl_dev_type t);
2 delete_repl_dev(uuid )
I think this can make sure we do not leak any repl_dev
cc @hkadayam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxichen
the
gorup_id
is a randomly generated uuid, and thepg_id
is a uint_16.but yes, we can generate
uuid
fromuint_16
, and getunit_16
from previously generateduuid
, so that we get a natural map betweenuuid
anduint_16
.the key point is that the pginfo originally exists in the
create_pg_message
, which need to be replicated across the repl_dev(raft group). if this message is not replicated to the majority of this group, then it is not committed. when restarts, the uncommitted message will probabaly be dropped,and repl_dev can not get pginfo from that dropped message.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to say if we can avoid replicating
pginfo
at all to avoid the 2PC in this PR proposal. If we created the repl_dev we are all set, followers can constructpginfo
from repl_dev. During recovery thepginfo
also can be recovered by the sb of repl_dev/raft_config;the 3 required fields we already have
members
which can get from raft cluster config, thereplica_set_uuid
isgroup_id
, the only thing needed ispg_id
.As you said , we can
replica_set_uuid
so we getpg_id
out fromreplicat_set_uuid
_user_ctx
field oflibnuraft::cluster_config
, c.f linkserailized_pg_info
into_user_ctx
2/3 can be implemented as the cluster_config was loaded by calling
RaftReplDev::load_config()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a summary, there are two problems:
1 HS side: how to handle create repl_dev failure(for example , fail to add any member to the raft group)?
this is the responsibility of HS, we should not consider this in HO side.
2 HO side: if we get a good repl_dev from HS, but we fail to create PG(for example, createPG message fail to be replicated to the majority of the raft group and is not committed, or crash happens at some moment ), how to do cleanup to avoid repl_dev leak?
according to the meeting today, this will be done in phrase three and by that time we will have my new features in homestore and homeobject, which are probably very helpful for solving this problem. so what about just leave the problem here for now (only happy path) and revisit it in phase 3.
@xiaoxichen @szmyd @sanebay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes lets create an issue and put a #FIXME in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure , done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxichen yes i know; i brought this up initially. Actually; there is one problem that is the generalized form of all problems:
All issues regarding repl_dev failure, pg_creation failure, migration during offline member (orphaned pg)...they generalize to this. The existence of a PG under that raft group is inconsequential. The solution is what I suggest, anything else is an optimization for a specialization of this issue and need not take the risk of complicating machinery just to correct something that is not in the immediate necessary to correct.
This is from years of fighting the raft membership issues with AM; we do try to cleanup in the cases that we can, but don't worry ourselves about it as the global cleanup routine will correct things. Many attempts to optimize resulted in out-of-quroum groups and deadlocks in the code.
We should just do what's easy here to start so things are correct and we can asses if there are timing issues that warrant an optimization beyond fire-and-forget cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szmyd yes you said makes perfect sense. The raft membership change in SM would be less than AM but still reconcile based on CM's PGMap is the right path. There are some timing issue we need to be careful though.