DAOS-4120 md: pool create error cleanup on corpc send failure #2189

kccain · 2020-03-25T02:44:54Z

With this change, when the management service while performing a pool
create fails to send the MGMT_TGT_CREATE corpc to the pool storage
target servers, the ds_mgmt_tgt_pool_destroy() function is called in
the error handling path. It prevents the following problem.

In rare cases when the dss_rpc_send() call fails (for example with
-DER_TIMEDOUT because the target servers are under heavy load), the
client will react to this timeout by retrying the pool create RPC with
the same UUID. Because target resources have not been destroyed in the
first attempt, this leads to a -DER_EXIST failure.

Signed-off-by: Ken Cain kenneth.c.cain@intel.com

With this change, when the management service while performing a pool create fails to send the MGMT_TGT_CREATE corpc to the pool storage target servers, the ds_mgmt_tgt_pool_destroy() function is called in the error handling path. It prevents the following problem. In rare cases when the dss_rpc_send() call fails (for example with -DER_TIMEDOUT because the target servers are under heavy load), the client will react to this timeout by retrying the pool create RPC with the same UUID. Because target resources have not been destroyed in the first attempt, this leads to a -DER_EXIST failure. Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

Cherry picked PR #2189 from daos master to release/0.9 branch. With this change, when the management service while performing a pool create fails to send the MGMT_TGT_CREATE corpc to the pool storage target servers, the ds_mgmt_tgt_pool_destroy() function is called in the error handling path. It prevents the following problem. In rare cases when the dss_rpc_send() call fails (for example with -DER_TIMEDOUT because the target servers are under heavy load), the client will react to this timeout by retrying the pool create RPC with the same UUID. Because target resources have not been destroyed in the first attempt, this leads to a -DER_EXIST failure. Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

daosbuild1 · 2020-03-25T15:04:00Z

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/2/display/redirect

daosbuild1 · 2020-03-25T17:24:11Z

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/3/display/redirect

daosbuild1 · 2020-03-25T18:32:58Z

Test stage run_test.sh completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-2189/4/display/redirect

kccain · 2020-03-26T12:02:46Z

The two test failures from build #6 appear to be instances of an already-bugged issue DAOS-4295 Random FIO test failures in master/PR runs.

Test / Functional_Hardware_Medium / 1-./io/fio_small.py:FioSmall.test_fio_small;dfuse-bs_256B-sequential-test-hosts-pool-0-1-a417 – FioSmall
2m 20s
Test / Functional_Hardware_Medium / 2-./io/fio_small.py:FioSmall.test_fio_small;dfuse-bs_256B-random-test-hosts-pool-0-1-e2ea – FioSmall

wangdi1

We probably need more complete way to make the whole process atomic. But this patch makes sense to me.

liw · 2020-03-27T02:54:02Z

src/mgmt/srv_pool.c

@@ -407,8 +407,11 @@ ds_mgmt_create_pool(uuid_t pool_uuid, const char *group, char *tgt_dev,
 	if (rc == 0 && DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
 		rc = -DER_TIMEDOUT;
 	if (rc != 0) {
+		if (!DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))


Why do we not print the error when a fault has been injected?

Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we inject a fault, I think we usually expect error messages. (In one extreme, if the D_ERROR line would segfault due to a bug, the fault injection might have caught it.) Hence, I feel the error in this case is acceptable.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

I recall I had the same question when adding the fault injection. I first placed it before the dss_rpc_send call, but then realized that by placing it after the dss_rpc_send, I could simulate a CoRPC that succeeded on some targets before hitting a timeout. That might outweigh the confusion. Any better options?

It's a bit nit-picking though, I admit.

I've updated the patch to make the error print unconditional

kccain · 2020-03-27T11:49:03Z

@daos-stack/daos-gatekeeper - as a convenience there is a companion PR for this change that is for the release/0.9 branch. Both can be landed at the same time if that helps reduce overhead.

#2195

liw · 2020-03-27T12:55:03Z

src/mgmt/srv_pool.c

@@ -407,8 +407,11 @@ ds_mgmt_create_pool(uuid_t pool_uuid, const char *group, char *tgt_dev,
 	if (rc == 0 && DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))
 		rc = -DER_TIMEDOUT;
 	if (rc != 0) {
+		if (!DAOS_FAIL_CHECK(DAOS_POOL_CREATE_FAIL_CORPC))


Maybe we should print the error message unconditionally (perhaps in a future commit). I thought it would be confusing if dss_rpc_send() actually succeeded and then print out an error message only because the code is injecting the fault after the fact.

If we inject a fault, I think we usually expect error messages. (In one extreme, if the D_ERROR line would segfault due to a bug, the fault injection might have caught it.) Hence, I feel the error in this case is acceptable.

If we print unconditinoally, do we care if for example there is a fault injection scenario (test expects to see -DER_TIMEDOUT), but the dss_rpc_send() also happens to actually fail with -DER_TIMEDOUT? No harm, or further confusion?

I recall I had the same question when adding the fault injection. I first placed it before the dss_rpc_send call, but then realized that by placing it after the dss_rpc_send, I could simulate a CoRPC that succeeded on some targets before hitting a timeout. That might outweigh the confusion. Any better options?

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

daosbuild1 · 2020-03-28T14:03:06Z

Test stage Functional_Hardware_Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-2189/7/testReport/(root)/

liw

Thanks.

kccain · 2020-03-30T17:16:23Z

Build 7 test failures:

2 fio_small failures like seen in prior builds, DAOS-4295.
unrelated to this patch, currently triaging: ior -a DAOS gets dfs_array_write error -1014 -DER_PROTO. Seen in different scenarios in both previous tickets and some newer ones such as DAOS-4394. Examining further to see if this instance needs its own new bug or not.

Also, for release/0.9 branch PR-2195 is ready to land for this change. No test failures found in that PR's most recent build.

Cherry picked PR #2189 from daos master to release/0.9 branch. With this change, when the management service while performing a pool create fails to send the MGMT_TGT_CREATE corpc to the pool storage target servers, the ds_mgmt_tgt_pool_destroy() function is called in the error handling path. It prevents the following problem. In rare cases when the dss_rpc_send() call fails (for example with -DER_TIMEDOUT because the target servers are under heavy load), the client will react to this timeout by retrying the pool create RPC with the same UUID. Because target resources have not been destroyed in the first attempt, this leads to a -DER_EXIST failure. Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

kccain force-pushed the kccain/daos_4120 branch from dfb21fd to a1faa8f Compare March 25, 2020 14:19

kccain marked this pull request as ready for review March 25, 2020 14:22

kccain requested review from liw and wangdi1 March 25, 2020 14:22

kccain mentioned this pull request Mar 25, 2020

DAOS-4120 md: pool create error cleanup on corpc send failure #2195

Merged

wangdi1 previously approved these changes Mar 26, 2020

View reviewed changes

liw approved these changes Mar 27, 2020

View reviewed changes

kccain requested a review from a team March 27, 2020 11:46

liw previously approved these changes Mar 27, 2020

View reviewed changes

Always print error message including fault injection cases.

facda5d

Signed-off-by: Ken Cain <kenneth.c.cain@intel.com>

kccain dismissed stale reviews from liw and wangdi1 via facda5d March 28, 2020 12:31

kccain removed the request for review from a team March 28, 2020 12:31

kccain requested review from liw and wangdi1 March 28, 2020 21:10

liw approved these changes Mar 30, 2020

View reviewed changes

wangdi1 approved these changes Mar 30, 2020

View reviewed changes

kccain requested a review from a team March 30, 2020 17:12

jolivier23 merged commit 4a8c7cc into master Mar 31, 2020

jolivier23 deleted the kccain/daos_4120 branch March 31, 2020 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-4120 md: pool create error cleanup on corpc send failure #2189

DAOS-4120 md: pool create error cleanup on corpc send failure #2189

kccain commented Mar 25, 2020 •

edited

Loading

daosbuild1 commented Mar 25, 2020

daosbuild1 commented Mar 25, 2020

daosbuild1 commented Mar 25, 2020

kccain commented Mar 26, 2020

wangdi1 left a comment

liw Mar 27, 2020

kccain Mar 27, 2020

liw Mar 27, 2020

liw Mar 27, 2020

kccain Mar 28, 2020

kccain commented Mar 27, 2020

liw Mar 27, 2020

daosbuild1 commented Mar 28, 2020

liw left a comment

kccain commented Mar 30, 2020 •

edited

Loading

DAOS-4120 md: pool create error cleanup on corpc send failure #2189

DAOS-4120 md: pool create error cleanup on corpc send failure #2189

Conversation

kccain commented Mar 25, 2020 • edited Loading

daosbuild1 commented Mar 25, 2020

daosbuild1 commented Mar 25, 2020

daosbuild1 commented Mar 25, 2020

kccain commented Mar 26, 2020

wangdi1 left a comment

Choose a reason for hiding this comment

liw Mar 27, 2020

Choose a reason for hiding this comment

kccain Mar 27, 2020

Choose a reason for hiding this comment

liw Mar 27, 2020

Choose a reason for hiding this comment

liw Mar 27, 2020

Choose a reason for hiding this comment

kccain Mar 28, 2020

Choose a reason for hiding this comment

kccain commented Mar 27, 2020

liw Mar 27, 2020

Choose a reason for hiding this comment

daosbuild1 commented Mar 28, 2020

liw left a comment

Choose a reason for hiding this comment

kccain commented Mar 30, 2020 • edited Loading

kccain commented Mar 25, 2020 •

edited

Loading

kccain commented Mar 30, 2020 •

edited

Loading