Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check "failmode wait" documentation and behavior #9395

Closed
arturpzol opened this issue Oct 2, 2019 · 7 comments
Closed

Check "failmode wait" documentation and behavior #9395

arturpzol opened this issue Oct 2, 2019 · 7 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang) Type: Documentation Indicates a requested change to the documentation

Comments

@arturpzol
Copy link

System information

Type Version/Name
Distribution Name Debian
Distribution Version Jessie
Linux Kernel 4.4
Architecture x86_64
ZFS Version 0.8.1
SPL Version 0.8.1

Describe the problem you're observing

I try to configure zpool on s3backer filesystem using a loopback device and has one ambiguity with failmode property.

According to manual when failmode=wait ZFS blocks all I/O access until the device connectivity is recovered and the errors are cleared. Unfortunately this does not happen.

When connection with S3 is broken (which is a rather common case in S3 connectivity) after a few minutes zpool goes to suspended state.

WARNING: Pool 'S3' has encountered an uncorrectable I/O failure and has been suspended.

I have also tried to configure all zio_deadman parameters with different values but effect is the same.

The only way out (known to me) of the suspended state is to zpool clear after resume the S3 connectivity but only possible when multihost is disabled.

Are there any other parameters for which I can specify that failmode=wait will wait indefinitely without zpool suspension.

@arturpzol
Copy link
Author

@kpande are you able to recommend any other project which exposes S3 as a virtual block device and is flexible for ZFS?

@ahmgithubahm
Copy link

ahmgithubahm commented Oct 15, 2019

@kpande there's only one open bug on the s3backer repo. What are the (incredible number of) other bugs? It looks useful, but it would be good to know what problems you've had, using it with ZFS.

@behlendorf behlendorf added the Type: Question Issue for discussion label Oct 24, 2019
@arturpzol
Copy link
Author

I think that s3backer is not the cause of the pool suspension.

If we remove all disks from pool with property failmode=wait, ZFS should block all I/O access until the device connectivity is recovered and the errors are cleared but unfortunately this does not happen:

zpool status -L
  pool: Pool-0
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        Pool-0      ONLINE       0     0     0
          sda       ONLINE       0     0     0

errors: No known data errors
echo 1 > /sys/block/sda/device/delete 

cat /proc/spl/kstat/zfs/Pool-0/state 
UNAVAIL

and after a few seconds:

cat /proc/spl/kstat/zfs/Pool-0/state 
SUSPENDED
dmesg
[ 1259.945756] scst: Detached from scsi1, channel 0, id 0, lun 0, type 0
[ 1259.946707] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 1259.946840] sd 1:0:0:0: [sda] Stopping disk
[ 1260.527080] zio pool=Pool-0 vdev=/dev/disk/by-id/wwn-0x5000cca3a8d2fa22-part1 error=5 type=2 offset=1000194437120 size=4096 flags=180ac0
[ 1260.527130] zio pool=Pool-0 vdev=/dev/disk/by-id/wwn-0x5000cca3a8d2fa22-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ 1260.527134] zio pool=Pool-0 vdev=/dev/disk/by-id/wwn-0x5000cca3a8d2fa22-part1 error=5 type=1 offset=1000194187264 size=8192 flags=b08c1
[ 1260.527138] zio pool=Pool-0 vdev=/dev/disk/by-id/wwn-0x5000cca3a8d2fa22-part1 error=5 type=1 offset=1000194449408 size=8192 flags=b08c1
[ 1260.667996] ata2.00: disabled
[ 1290.818430] WARNING: Pool 'Pool-0' has encountered an uncorrectable I/O failure and has been suspended.

zpool reopen  
cannot reopen 'Pool-0': pool I/O is currently suspended

and even if we device connectivity is recovered pool is still suspended:

echo "- - -" > /sys/class/scsi_host/host1/scan

dmesg 
[ 1707.719391] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1707.720538] ata2.00: ATA-8: Hitachi HUA722010CLA330, JP4OA3EA, max UDMA/133
[ 1707.720541] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[ 1707.721878] ata2.00: configured for UDMA/133
[ 1707.722002] scsi 1:0:0:0: Direct-Access     ATA      Hitachi HUA72201 A3EA PQ: 0 ANSI: 5
[ 1707.722173] sd 1:0:0:0: Attached scsi generic sg0 type 0
[ 1707.722185] scst: Attached to scsi1, channel 0, id 0, lun 0, type 0
[ 1707.722291] sd 1:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[ 1707.722339] sd 1:0:0:0: [sda] Write Protect is off
[ 1707.722340] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 1707.722406] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

zpool reopen 
cannot reopen 'Pool-0': pool I/O is currently suspended

cat /proc/spl/kstat/zfs/Pool-0/state 
SUSPENDED

so I think failmode=wait doesn't work according to the zpool manual.

@arturpzol
Copy link
Author

I have also tried to simulate SCSI delay using scsi_debug module:

modprobe scsi_debug dev_size_mb=5000
echo 0 > /sys/module/scsi_debug/parameters/delay

zpool create Pool-0-scsi_debug /dev/sdc

zpool get failmode Pool-0-scsi_debug          
NAME               PROPERTY  VALUE     SOURCE
Pool-0-scsi_debug  failmode  wait      default

cat /proc/spl/kstat/zfs/Pool-0-scsi_debug/state 
ONLINE

echo 30000 > /sys/module/scsi_debug/parameters/delay

and after some time we have pool suspension so failmode=wait does not wait forever blocking all I/O access:

cat /proc/spl/kstat/zfs/Pool-0-scsi_debug/state 
SUSPENDED

WARNING: Pool 'Pool-0-scsi_debug' has encountered an uncorrectable I/O failure and has been suspended.

@tabulon
Copy link

tabulon commented Apr 21, 2020

Based on previous comments from @arturpzol on this issue, perhaps a more suitable title for this would be "failmode=wait does not behave as documented", since the issue appears to be reproducable with or without s3backer being involved.

@behlendorf behlendorf changed the title Zpool on s3backer filesystem using a loopback device and wait failmode "failmode waitwait failmode Dec 21, 2020
@behlendorf behlendorf changed the title "failmode waitwait failmode Check "failmode wait" documentation and behavior Dec 21, 2020
@behlendorf behlendorf added Type: Defect Incorrect behavior (e.g. crash, hang) Type: Documentation Indicates a requested change to the documentation and removed Type: Question Issue for discussion labels Dec 21, 2020
@stale
Copy link

stale bot commented Dec 22, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Dec 22, 2021
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 23, 2021
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#9395
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 23, 2021
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#9395
@behlendorf
Copy link
Contributor

@arturpzol I believe this is a matter of documentation. It is possible to resume the pool in your scsi_debug example above, but there are two additional steps.

  1. The scsi_debug device needs to be brought back online.
$ echo 0 > /sys/module/scsi_debug/parameters/delay
$ cat /sys/block/sda/device/state 
offline
$ echo running >/sys/block/sda/device/state 
  1. The zpool clear command is what should be used to resume a suspended pool.
$ zpool clear Pool-0-scsi_debug

I've opened #12907 to try and clarify the documentation regarding using zpool clear to bring the pool back online.

@stale stale bot removed the Status: Stale No recent activity for issue label Dec 23, 2021
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Feb 10, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Feb 14, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Feb 16, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Feb 17, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Nowhere in the description of the failmode property does it
clearly state how to bring a suspended pool back online.
Add a few words to property description and the zpool-clear(8)
man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12907
Closes openzfs#9395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang) Type: Documentation Indicates a requested change to the documentation
Projects
None yet
Development

No branches or pull requests

5 participants
@behlendorf @ahmgithubahm @arturpzol @tabulon and others