Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-4120 md: pool create error cleanup on corpc send failure #2189

Merged
merged 2 commits into from
Mar 31, 2020

Conversation

kccain
Copy link
Contributor

@kccain kccain commented Mar 25, 2020

With this change, when the management service while performing a pool
create fails to send the MGMT_TGT_CREATE corpc to the pool storage
target servers, the ds_mgmt_tgt_pool_destroy() function is called in
the error handling path. It prevents the following problem.

In rare cases when the dss_rpc_send() call fails (for example with
-DER_TIMEDOUT because the target servers are under heavy load), the
client will react to this timeout by retrying the pool create RPC with
the same UUID. Because target resources have not been destroyed in the
first attempt, this leads to a -DER_EXIST failure.

Signed-off-by: Ken Cain kenneth.c.cain@intel.com

With this change, when the management service while performing a pool
create fails to send the MGMT_TGT_CREATE corpc to the pool storage
target servers, the ds_mgmt_tgt_pool_destroy() function is called in
the error handling path. It prevents the following problem.

In rare cases when the dss_rpc_send() call fails (for example with
-DER_TIMEDOUT because the target servers are under heavy load), the
client will react to this timeout by retrying the pool create RPC with
the same UUID. Because target resources have not been destroyed in the
first attempt, this leads to a -DER_EXIST failure.

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
@kccain kccain force-pushed the kccain/daos_4120 branch from dfb21fd to a1faa8f Compare March 25, 2020 14:19
@kccain kccain marked this pull request as ready for review March 25, 2020 14:22
@kccain kccain requested review from liw and wangdi1 March 25, 2020 14:22
kccain added a commit that referenced this pull request Mar 25, 2020
Cherry picked PR #2189 from daos master to release/0.9 branch.

With this change, when the management service while performing a pool
create fails to send the MGMT_TGT_CREATE corpc to the pool storage
target servers, the ds_mgmt_tgt_pool_destroy() function is called in
the error handling path. It prevents the following problem.

In rare cases when the dss_rpc_send() call fails (for example with
-DER_TIMEDOUT because the target servers are under heavy load), the
client will react to this timeout by retrying the pool create RPC with
the same UUID. Because target resources have not been destroyed in the
first attempt, this leads to a -DER_EXIST failure.

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/2/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/4/display/redirect

@kccain
Copy link
Contributor Author

kccain commented Mar 26, 2020

The two test failures from build #6 appear to be instances of an already-bugged issue DAOS-4295 Random FIO test failures in master/PR runs.

Test / Functional_Hardware_Medium / 1-./io/fio_small.py:FioSmall.test_fio_small;dfuse-bs_256B-sequential-test-hosts-pool-0-1-a417 – FioSmall
2m 20s
Test / Functional_Hardware_Medium / 2-./io/fio_small.py:FioSmall.test_fio_small;dfuse-bs_256B-random-test-hosts-pool-0-1-e2ea – FioSmall

wangdi1
wangdi1 previously approved these changes Mar 26, 2020
Copy link
Contributor

@wangdi1 wangdi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more complete way to make the whole process atomic. But this patch makes sense to me.

@@ -407,8 +407,11 @@ ds_mgmt_create_pool(uuid_t pool_uuid, const char *group, char *tgt_dev,
if (rc == 0 && DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
rc = -DER_TIMEDOUT;
if (rc != 0) {
if (!DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not print the error when a fault has been injected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we inject a fault, I think we usually expect error messages. (In one extreme, if the D_ERROR line would segfault due to a bug, the fault injection might have caught it.) Hence, I feel the error in this case is acceptable.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

I recall I had the same question when adding the fault injection. I first placed it before the dss_rpc_send call, but then realized that by placing it after the dss_rpc_send, I could simulate a CoRPC that succeeded on some targets before hitting a timeout. That might outweigh the confusion. Any better options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit nit-picking though, I admit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the patch to make the error print unconditional

@kccain kccain requested a review from a team March 27, 2020 11:46
@kccain
Copy link
Contributor Author

kccain commented Mar 27, 2020

@daos-stack/daos-gatekeeper - as a convenience there is a companion PR for this change that is for the release/0.9 branch. Both can be landed at the same time if that helps reduce overhead.

#2195

liw
liw previously approved these changes Mar 27, 2020
@@ -407,8 +407,11 @@ ds_mgmt_create_pool(uuid_t pool_uuid, const char *group, char *tgt_dev,
if (rc == 0 && DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
rc = -DER_TIMEDOUT;
if (rc != 0) {
if (!DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we inject a fault, I think we usually expect error messages. (In one extreme, if the D_ERROR line would segfault due to a bug, the fault injection might have caught it.) Hence, I feel the error in this case is acceptable.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

I recall I had the same question when adding the fault injection. I first placed it before the dss_rpc_send call, but then realized that by placing it after the dss_rpc_send, I could simulate a CoRPC that succeeded on some targets before hitting a timeout. That might outweigh the confusion. Any better options?

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
@kccain kccain dismissed stale reviews from liw and wangdi1 via facda5d March 28, 2020 12:31
@kccain kccain removed the request for review from a team March 28, 2020 12:31
@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-2189/7/testReport/(root)/

@kccain kccain requested review from liw and wangdi1 March 28, 2020 21:10
Copy link
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@kccain kccain requested a review from a team March 30, 2020 17:12
@kccain
Copy link
Contributor Author

kccain commented Mar 30, 2020

Build 7 test failures:

  • 2 fio_small failures like seen in prior builds, DAOS-4295.
  • unrelated to this patch, currently triaging: ior -a DAOS gets dfs_array_write error -1014 -DER_PROTO. Seen in different scenarios in both previous tickets and some newer ones such as DAOS-4394. Examining further to see if this instance needs its own new bug or not.

Also, for release/0.9 branch PR-2195 is ready to land for this change. No test failures found in that PR's most recent build.

@jolivier23 jolivier23 merged commit 4a8c7cc into master Mar 31, 2020
@jolivier23 jolivier23 deleted the kccain/daos_4120 branch March 31, 2020 19:18
jolivier23 pushed a commit that referenced this pull request Mar 31, 2020
Cherry picked PR #2189 from daos master to release/0.9 branch.

With this change, when the management service while performing a pool
create fails to send the MGMT_TGT_CREATE corpc to the pool storage
target servers, the ds_mgmt_tgt_pool_destroy() function is called in
the error handling path. It prevents the following problem.

In rare cases when the dss_rpc_send() call fails (for example with
-DER_TIMEDOUT because the target servers are under heavy load), the
client will react to this timeout by retrying the pool create RPC with
the same UUID. Because target resources have not been destroyed in the
first attempt, this leads to a -DER_EXIST failure.

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants