Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-6970 iv: avoid pool connect/disconnect failure #4903

Merged
merged 1 commit into from
Mar 21, 2021
Merged

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Mar 7, 2021

Ignore -DER_NONEXIST failure for IV during pool connect and
disconnect, in case some ranks are being stopped at that
time.

Signed-off-by: Di Wang di.wang@intel.com

if (rc == -DER_NONEXIST) {
/* If one of rank is being stopped, it may
* return -DER_NONEXIST, which can be ignored
* during capablity invalidate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* during capablity invalidate.
* during capability invalidate.

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4903/1/execution/node/1389/log

@@ -110,7 +110,7 @@ crt_corpc_initiate(struct crt_rpc_priv *rpc_priv)
} else {
D_ERROR("crt_grp_lookup_grpid: %s failed.\n",
co_hdr->coh_grpid);
D_GOTO(out, rc = -DER_INVAL);
D_GOTO(out, rc = -DER_NONEXIST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-DER_NONEXIST is probably an error that is too common, especially outside of crt. It'd be safer to find or define a new crt error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DER_OOG? any other suggestion? @frostedcmos @liw ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using a new -DER_GRP or -DER_GRPNOTFOUND, with a very specific error comment, so that people will unlikely use it for other purposes. For instance, write "crt group not found", instead of "group not found", so that it will unlikely be used for "redundancy groups", "placement groups", etc.

@@ -1635,6 +1635,16 @@ cont_close_recs(crt_context_t ctx, struct cont_svc *svc,
rc = cont_iv_capability_invalidate(
svc->cs_pool->sp_iv_ns,
recs[i].tcr_hdl, CRT_IV_SYNC_EAGER);
if (rc == -DER_NONEXIST) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the underlying CoRPC encountered two errors,

  • one on rank X, who had the pool started, due to some network trouble, and
  • one on rank Y, who didn't have the pool started,

then will the "group doesn't exist" error from rank Y hide the "network issue" error from rank X? If it will, when will rank X invalidate the capa in its memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am not sure I understand these changes.

If some node in the spanning tree happens to return DER_NONEXIST as it does not have the group then invalidate won't propagate past it to other nodes that will have the group. How is this condition being handled?

Isn't there some retry mechanism that should be kicking in to ensure invalidate request reaches all nodes eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rank stopping will cause the IV not being broadcast to all nodes, then it will need retry indeed. Hmm, return -DER_OOG will cause iv to retry internally.

@liw if one rank is not part of the pool, then iv requests or corpc will not send rpc to this rank? Not sure I follow the case here. The intention here is to not fail for stopping node, but I am not sure which is the better choice, ignore or retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex's example is clearer.

Generally, we should retry until this rank is excluded. The gap is the rank may be running but not have the pool started (i.e., not have the pool's SG created). In this case, we don't have a mechanism to trigger a pool map change for excluding this rank. I think we can introduce in-line failure detection (i.e., if a rank does not respond to an RPC in certain ways over certain numbers of retries, we ask the PS to decide whether this rank shall be excluded).

Specifically, the pool handle IV update case is okay, while the invalidation case is problematic. We would have to address the specific capa protocol if we choose not to follow the general rule.

Thoughts?

@wangdi1 wangdi1 requested a review from liw March 9, 2021 20:22
@daosbuild1 daosbuild1 dismissed their stale review March 9, 2021 20:23

Updated patch

if (rc == -DER_SHUTDOWN) {
/* If one of rank is being stopped, it may
* return -DER_SHUTDOWN, which can be ignored
* during capablity invalidate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* during capablity invalidate.
* during capability invalidate.

@daosbuild1
Copy link
Collaborator

liuxuezhao
liuxuezhao previously approved these changes Mar 11, 2021
*/
D_ERROR("crt_grp_lookup_grpid: %s failed: %d\n",
co_hdr->coh_grpid, -DER_GRPVER);
D_GOTO(out, rc = -DER_GRPVER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious how can it happen? when pool created the sub-group should be created already, why in bcast the crt_grp_lookup_grpid() possibly fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the rank is being stopped at the same time, then crt_grp_lookup_grpid() on this rank will fail. I posted the log on the ticket you can check, thanks.

if (rc)
if (rc) {
if (rc == -DER_SHUTDOWN) {
D_DEBUG(DF_DSMS, DF_UUID":"DF_UUID" some rank stop.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ran stop/ranks stopped/ :)

D_DEBUG(DB_MD, DF_UUID" stopping\n",
DP_UUID(ns->iv_pool_uuid));
return NULL;
return -DER_SHUTDOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending -DER_SHUTDOWN from one rank to another sounds strange, if not potentially confusing. It is used to indicate that, locally, a lower layer has encountered an error (e.g., ENOMEM, EIO), and upper layers of the service (e.g., Pool Service) should trigger a shutdown, so that a different service replica could take over. In this PR's case, "the IV NS is shutting down" should not imply that "a different rank should shut down", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I do not see much of confusion to use SHUTDOWN outside of service change. To me shutdown just means the service is being shutdown somewhere, either switch/ignore/failure depend on the caller.

I am not sure we need invite a new error code here, what error do you suggest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liw any error code suggestion?

@wangdi1 wangdi1 requested a review from liw March 17, 2021 00:56
liw
liw previously approved these changes Mar 17, 2021
1. If the rank is being stopped, then iv operation might
return DER_SHUTDOWN, let's ignore such failure for pool
disconnect/connect.

2. Return GRPVER if the group can not be found in local
SG, so let's the caller retry until the rank is excluded
from the pool map or the pool service is stopped.

Signed-off-by: Di Wang <di.wang@intel.com>
@wangdi1 wangdi1 dismissed stale reviews from liw and liuxuezhao via 2141969 March 17, 2021 17:34
@wangdi1 wangdi1 requested review from liw and liuxuezhao March 17, 2021 17:35
@wangdi1
Copy link
Contributor Author

wangdi1 commented Mar 17, 2021

only updated the patch to fix the spelling error. Please recheck. Thanks. @liw @liuxuezhao

@daosbuild1 daosbuild1 dismissed their stale review March 17, 2021 17:36

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@jolivier23 jolivier23 merged commit 52e4828 into master Mar 21, 2021
@jolivier23 jolivier23 deleted the daos_6970 branch March 21, 2021 16:44
wangdi1 pushed a commit that referenced this pull request Mar 22, 2021
1. If the rank is being stopped, then iv operation might
return DER_SHUTDOWN, let's ignore such failure for pool
disconnect/connect.

2. Return GRPVER if the group can not be found in local
SG, so let's the caller retry until the rank is excluded
from the pool map or the pool service is stopped.

Signed-off-by: Di Wang <di.wang@intel.com>
johannlombardi pushed a commit that referenced this pull request Mar 25, 2021
1. If the rank is being stopped, then iv operation might
return DER_SHUTDOWN, let's ignore such failure for pool
disconnect/connect.

2. Return GRPVER if the group can not be found in local
SG, so let's the caller retry until the rank is excluded
from the pool map or the pool service is stopped.

Signed-off-by: Di Wang <di.wang@intel.com>
@ashleypittman ashleypittman mentioned this pull request Apr 28, 2021
@ashleypittman ashleypittman mentioned this pull request May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants