Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-7056 object: do not retry internally for migration #5106

Merged
merged 1 commit into from
Mar 25, 2021

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Mar 21, 2021

  1. Do not retry internally for migration, because during
    system shutdown, if the migration is inside the loop of
    retry, for example keeping refreshing the pool map from
    the pool leader, then there is no easy way to stop the
    migration process inside the client stack. So let's return
    all failure to the migration. If there is failure happens,
    migration(rebuild) will requeue the job anyway.

  2. Add schedule delay time to rebuild task, instead of sleeping
    directly in rebuild_task_ult(), since it might blocking the
    current rebuild to finish.

Signed-off-by: Di Wang di.wang@intel.com

1. Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.

2. Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.

Signed-off-by: Di Wang <di.wang@intel.com>
@wangdi1 wangdi1 requested review from liuxuezhao and NiuYawei March 21, 2021 02:13
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@@ -3618,7 +3618,7 @@ obj_comp_cb(tse_task_t *task, void *data)
DAOS_FAIL_CHECK(DAOS_DTX_NO_RETRY))
obj_auxi->io_retry = 0;

if (pm_stale || obj_auxi->io_retry)
if (!obj_auxi->no_retry && (pm_stale || obj_auxi->io_retry))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see why we need a special flag here, I suppose for both client and server stack, it should retry only for recoverable errors, for server shutdown, a non-recoverable error should be returned?

With current change, the rebuild will fail when pool leader changed? Is it acceptable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, leader change or pool map change will cause rebuild fail, but it will retry the rebuild anyway.

The original implementation will retry inside the client object stack, which might cause trouble if it want to abort the rebuild when it is in retry loop inside the object client stack. There is no easy to notify the client stack to stop retry on server side.

So I choose to export all failure to migration, and let migration to check and retry.

rc = daos_gettime_coarse(&cur_ts);
D_ASSERT(rc == 0);

if (cur_ts < task->dst_schedule_time ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit the purpose of this delay and what's the problem of original dss_sleep(1000)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, the original implementation is blocking the current rebuild finish, see those following part after dss_sleep() in rebuild_task_ult(). So I move the wait into the rebuild task queue.

@wangdi1 wangdi1 requested a review from NiuYawei March 24, 2021 23:15
@wangdi1
Copy link
Contributor Author

wangdi1 commented Mar 24, 2021

@NiuYawei @liuxuezhao please check the patch.

@johannlombardi johannlombardi merged commit 90d1efe into master Mar 25, 2021
@johannlombardi johannlombardi deleted the daos_sys_cleanup branch March 25, 2021 16:10
wangdi1 pushed a commit that referenced this pull request Apr 6, 2021
1. Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.

2. Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.

Signed-off-by: Di Wang <di.wang@intel.com>
wangdi1 pushed a commit that referenced this pull request Apr 7, 2021
1. Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.

2. Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.

Signed-off-by: Di Wang <di.wang@intel.com>
jolivier23 pushed a commit that referenced this pull request Apr 8, 2021
1. Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.

2. Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.

Signed-off-by: Di Wang <di.wang@intel.com>
@ashleypittman ashleypittman mentioned this pull request Apr 28, 2021
@ashleypittman ashleypittman mentioned this pull request May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants