runner:fix the ret when reopen dev failed in cmd stpg #537

lmh10144360 · 2019-03-02T03:16:35Z

if the device with the successful close()
but failed open() in tcmu_acquire_dev_lock,after
the cmd stpg failed,the device will never reopen
again。so change the ret from TCMU_STS_FENCED to
TCMU_STS_TIMEOUT,let the device into the recovery
list.

Signed-off-by: 李明辉10144360 li.minghui7@zte.com.cn

if the device with the successful close() but failed open() in tcmu_acquire_dev_lock,after the cmd stpg failed,the device will never reopen again。so change the ret from TCMU_STS_FENCED to TCMU_STS_TIMEOUT,let the device into the recovery list. Signed-off-by: 李明辉10144360 <li.minghui7@zte.com.cn>

mikechristie · 2019-03-02T03:34:40Z

I didn't understand the problem description. Did this patch:

commit 08e3a0e
Author: Mike Christie mchristi@redhat.com
Date: Tue Sep 11 18:48:13 2018 -0500

runner: don't drop iscsi connection on lock fence errors

cause a bug for you?

We can't go into that if block that does tcmu_notify_conn_lost because it causes the path bouncing which can end up in the command retries being used up and paths not being added.

In your case is the following happening:

TCMU_STS_FENCED is returned to tcmu_explicit_transition and that gets translated to a SCSI BUSY status.
The initiator gets BUSY and retries.
tcmu-runner gets the STPG and calls tcmu_acquire_dev_lock again. It sees that TCMUR_DEV_FLAG_IS_OPEN is not set, so it calls tcmu_reopen_dev.

In your case is the initiator not retrying the BUSY status? Is it because the initiator does not retry for that SCSI status code or is it being retired so many times that the cmd has run out of retries? If either of those, what is the initiator OS?

lmh10144360 · 2019-03-02T05:57:23Z

Thanks for reply !
yes,it is being retried，but when the cmd timeout is reached in scsi level，
it will finish the command with failed !
with the initiator OS centos7.4，the timeout is setted to 360S，if stpg is not
return ok in 360S，the cmd will be failed, and will not be retried anymore.

mikechristie · 2019-03-13T03:28:46Z

Hey, Sorry for the late reply.

It looks like you have the same problem with TCMU_STS_TIMEOUT. Won't you have limited retries for that case too?

What is your pg_init_retries set to on the initiator side in /etc/multipath.conf?

lxbsz changed the base branch from master to main August 10, 2022 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runner:fix the ret when reopen dev failed in cmd stpg #537

runner:fix the ret when reopen dev failed in cmd stpg #537

lmh10144360 commented Mar 2, 2019

mikechristie commented Mar 2, 2019

lmh10144360 commented Mar 2, 2019

mikechristie commented Mar 13, 2019

runner:fix the ret when reopen dev failed in cmd stpg #537

Are you sure you want to change the base?

runner:fix the ret when reopen dev failed in cmd stpg #537

Conversation

lmh10144360 commented Mar 2, 2019

mikechristie commented Mar 2, 2019

lmh10144360 commented Mar 2, 2019

mikechristie commented Mar 13, 2019