DAOS-7056 object: do not retry internally for migration #5106

wangdi1 · 2021-03-21T02:13:23Z

Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.
Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.

Signed-off-by: Di Wang di.wang@intel.com

1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

NiuYawei · 2021-03-22T03:31:15Z

src/object/cli_obj.c

@@ -3618,7 +3618,7 @@ obj_comp_cb(tse_task_t *task, void *data)
 	    DAOS_FAIL_CHECK(DAOS_DTX_NO_RETRY))
 		obj_auxi->io_retry = 0;

-	if (pm_stale || obj_auxi->io_retry)
+	if (!obj_auxi->no_retry && (pm_stale || obj_auxi->io_retry))


I don't quite see why we need a special flag here, I suppose for both client and server stack, it should retry only for recoverable errors, for server shutdown, a non-recoverable error should be returned?

With current change, the rebuild will fail when pool leader changed? Is it acceptable?

yes, leader change or pool map change will cause rebuild fail, but it will retry the rebuild anyway.

The original implementation will retry inside the client object stack, which might cause trouble if it want to abort the rebuild when it is in retry loop inside the object client stack. There is no easy to notify the client stack to stop retry on server side.

So I choose to export all failure to migration, and let migration to check and retry.

NiuYawei · 2021-03-22T03:34:54Z

src/rebuild/srv.c

+			rc = daos_gettime_coarse(&cur_ts);
+			D_ASSERT(rc == 0);
+
+			if (cur_ts < task->dst_schedule_time ||


Could you explain a bit the purpose of this delay and what's the problem of original dss_sleep(1000)?

oh, the original implementation is blocking the current rebuild finish, see those following part after dss_sleep() in rebuild_task_ult(). So I move the wait into the rebuild task queue.

wangdi1 · 2021-03-24T23:24:05Z

@NiuYawei @liuxuezhao please check the patch.

1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>

wangdi1 requested review from liuxuezhao and NiuYawei March 21, 2021 02:13

daosbuild1 reviewed Mar 21, 2021

View reviewed changes

NiuYawei reviewed Mar 22, 2021

View reviewed changes

wangdi1 requested a review from NiuYawei March 24, 2021 23:15

liuxuezhao approved these changes Mar 25, 2021

View reviewed changes

NiuYawei approved these changes Mar 25, 2021

View reviewed changes

johannlombardi merged commit 90d1efe into master Mar 25, 2021

johannlombardi deleted the daos_sys_cleanup branch March 25, 2021 16:10

ashleypittman mentioned this pull request Apr 28, 2021

1.2 dfuse single #5582

Closed

ashleypittman mentioned this pull request May 20, 2021

1.2 proc #5753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-7056 object: do not retry internally for migration #5106

DAOS-7056 object: do not retry internally for migration #5106

wangdi1 commented Mar 21, 2021

daosbuild1 left a comment

NiuYawei Mar 22, 2021

wangdi1 Mar 24, 2021

NiuYawei Mar 22, 2021

wangdi1 Mar 24, 2021

wangdi1 commented Mar 24, 2021

DAOS-7056 object: do not retry internally for migration #5106

DAOS-7056 object: do not retry internally for migration #5106

Conversation

wangdi1 commented Mar 21, 2021

daosbuild1 left a comment

Choose a reason for hiding this comment

NiuYawei Mar 22, 2021

Choose a reason for hiding this comment

wangdi1 Mar 24, 2021

Choose a reason for hiding this comment

NiuYawei Mar 22, 2021

Choose a reason for hiding this comment

wangdi1 Mar 24, 2021

Choose a reason for hiding this comment

wangdi1 commented Mar 24, 2021