-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-7056 object: do not retry internally for migration #5106
Conversation
1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
@@ -3618,7 +3618,7 @@ obj_comp_cb(tse_task_t *task, void *data) | |||
DAOS_FAIL_CHECK(DAOS_DTX_NO_RETRY)) | |||
obj_auxi->io_retry = 0; | |||
|
|||
if (pm_stale || obj_auxi->io_retry) | |||
if (!obj_auxi->no_retry && (pm_stale || obj_auxi->io_retry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see why we need a special flag here, I suppose for both client and server stack, it should retry only for recoverable errors, for server shutdown, a non-recoverable error should be returned?
With current change, the rebuild will fail when pool leader changed? Is it acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, leader change or pool map change will cause rebuild fail, but it will retry the rebuild anyway.
The original implementation will retry inside the client object stack, which might cause trouble if it want to abort the rebuild when it is in retry loop inside the object client stack. There is no easy to notify the client stack to stop retry on server side.
So I choose to export all failure to migration, and let migration to check and retry.
rc = daos_gettime_coarse(&cur_ts); | ||
D_ASSERT(rc == 0); | ||
|
||
if (cur_ts < task->dst_schedule_time || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a bit the purpose of this delay and what's the problem of original dss_sleep(1000)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the original implementation is blocking the current rebuild finish, see those following part after dss_sleep() in rebuild_task_ult(). So I move the wait into the rebuild task queue.
@NiuYawei @liuxuezhao please check the patch. |
1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>
1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>
1. Do not retry internally for migration, because during system shutdown, if the migration is inside the loop of retry, for example keeping refreshing the pool map from the pool leader, then there is no easy way to stop the migration process inside the client stack. So let's return all failure to the migration. If there is failure happens, migration(rebuild) will requeue the job anyway. 2. Add schedule delay time to rebuild task, instead of sleeping directly in rebuild_task_ult(), since it might blocking the current rebuild to finish. Signed-off-by: Di Wang <di.wang@intel.com>
Do not retry internally for migration, because during
system shutdown, if the migration is inside the loop of
retry, for example keeping refreshing the pool map from
the pool leader, then there is no easy way to stop the
migration process inside the client stack. So let's return
all failure to the migration. If there is failure happens,
migration(rebuild) will requeue the job anyway.
Add schedule delay time to rebuild task, instead of sleeping
directly in rebuild_task_ult(), since it might blocking the
current rebuild to finish.
Signed-off-by: Di Wang di.wang@intel.com