-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-6970 iv: avoid pool connect/disconnect failure #4903
Conversation
src/container/srv_container.c
Outdated
if (rc == -DER_NONEXIST) { | ||
/* If one of rank is being stopped, it may | ||
* return -DER_NONEXIST, which can be ignored | ||
* during capablity invalidate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* during capablity invalidate. | |
* during capability invalidate. |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4903/1/execution/node/72/log |
Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4903/1/execution/node/1389/log |
src/cart/crt_corpc.c
Outdated
@@ -110,7 +110,7 @@ crt_corpc_initiate(struct crt_rpc_priv *rpc_priv) | |||
} else { | |||
D_ERROR("crt_grp_lookup_grpid: %s failed.\n", | |||
co_hdr->coh_grpid); | |||
D_GOTO(out, rc = -DER_INVAL); | |||
D_GOTO(out, rc = -DER_NONEXIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-DER_NONEXIST is probably an error that is too common, especially outside of crt. It'd be safer to find or define a new crt error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DER_OOG? any other suggestion? @frostedcmos @liw ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using a new -DER_GRP or -DER_GRPNOTFOUND, with a very specific error comment, so that people will unlikely use it for other purposes. For instance, write "crt group not found", instead of "group not found", so that it will unlikely be used for "redundancy groups", "placement groups", etc.
src/container/srv_container.c
Outdated
@@ -1635,6 +1635,16 @@ cont_close_recs(crt_context_t ctx, struct cont_svc *svc, | |||
rc = cont_iv_capability_invalidate( | |||
svc->cs_pool->sp_iv_ns, | |||
recs[i].tcr_hdl, CRT_IV_SYNC_EAGER); | |||
if (rc == -DER_NONEXIST) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the underlying CoRPC encountered two errors,
- one on rank X, who had the pool started, due to some network trouble, and
- one on rank Y, who didn't have the pool started,
then will the "group doesn't exist" error from rank Y hide the "network issue" error from rank X? If it will, when will rank X invalidate the capa in its memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am not sure I understand these changes.
If some node in the spanning tree happens to return DER_NONEXIST as it does not have the group then invalidate won't propagate past it to other nodes that will have the group. How is this condition being handled?
Isn't there some retry mechanism that should be kicking in to ensure invalidate request reaches all nodes eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rank stopping will cause the IV not being broadcast to all nodes, then it will need retry indeed. Hmm, return -DER_OOG will cause iv to retry internally.
@liw if one rank is not part of the pool, then iv requests or corpc will not send rpc to this rank? Not sure I follow the case here. The intention here is to not fail for stopping node, but I am not sure which is the better choice, ignore or retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alex's example is clearer.
Generally, we should retry until this rank is excluded. The gap is the rank may be running but not have the pool started (i.e., not have the pool's SG created). In this case, we don't have a mechanism to trigger a pool map change for excluding this rank. I think we can introduce in-line failure detection (i.e., if a rank does not respond to an RPC in certain ways over certain numbers of retries, we ask the PS to decide whether this rank shall be excluded).
Specifically, the pool handle IV update case is okay, while the invalidation case is problematic. We would have to address the specific capa protocol if we choose not to follow the general rule.
Thoughts?
src/container/srv_container.c
Outdated
if (rc == -DER_SHUTDOWN) { | ||
/* If one of rank is being stopped, it may | ||
* return -DER_SHUTDOWN, which can be ignored | ||
* during capablity invalidate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* during capablity invalidate. | |
* during capability invalidate. |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4903/2/execution/node/72/log |
*/ | ||
D_ERROR("crt_grp_lookup_grpid: %s failed: %d\n", | ||
co_hdr->coh_grpid, -DER_GRPVER); | ||
D_GOTO(out, rc = -DER_GRPVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious how can it happen? when pool created the sub-group should be created already, why in bcast the crt_grp_lookup_grpid() possibly fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the rank is being stopped at the same time, then crt_grp_lookup_grpid() on this rank will fail. I posted the log on the ticket you can check, thanks.
src/pool/srv_pool.c
Outdated
if (rc) | ||
if (rc) { | ||
if (rc == -DER_SHUTDOWN) { | ||
D_DEBUG(DF_DSMS, DF_UUID":"DF_UUID" some rank stop.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ran stop/ranks stopped/ :)
D_DEBUG(DB_MD, DF_UUID" stopping\n", | ||
DP_UUID(ns->iv_pool_uuid)); | ||
return NULL; | ||
return -DER_SHUTDOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending -DER_SHUTDOWN from one rank to another sounds strange, if not potentially confusing. It is used to indicate that, locally, a lower layer has encountered an error (e.g., ENOMEM, EIO), and upper layers of the service (e.g., Pool Service) should trigger a shutdown, so that a different service replica could take over. In this PR's case, "the IV NS is shutting down" should not imply that "a different rank should shut down", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I do not see much of confusion to use SHUTDOWN outside of service change. To me shutdown just means the service is being shutdown somewhere, either switch/ignore/failure depend on the caller.
I am not sure we need invite a new error code here, what error do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liw any error code suggestion?
1. If the rank is being stopped, then iv operation might return DER_SHUTDOWN, let's ignore such failure for pool disconnect/connect. 2. Return GRPVER if the group can not be found in local SG, so let's the caller retry until the rank is excluded from the pool map or the pool service is stopped. Signed-off-by: Di Wang <di.wang@intel.com>
only updated the patch to fix the spelling error. Please recheck. Thanks. @liw @liuxuezhao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
1. If the rank is being stopped, then iv operation might return DER_SHUTDOWN, let's ignore such failure for pool disconnect/connect. 2. Return GRPVER if the group can not be found in local SG, so let's the caller retry until the rank is excluded from the pool map or the pool service is stopped. Signed-off-by: Di Wang <di.wang@intel.com>
1. If the rank is being stopped, then iv operation might return DER_SHUTDOWN, let's ignore such failure for pool disconnect/connect. 2. Return GRPVER if the group can not be found in local SG, so let's the caller retry until the rank is excluded from the pool map or the pool service is stopped. Signed-off-by: Di Wang <di.wang@intel.com>
Ignore -DER_NONEXIST failure for IV during pool connect and
disconnect, in case some ranks are being stopped at that
time.
Signed-off-by: Di Wang di.wang@intel.com