Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce module parameters for handling IO timeouts (WIP) #2387

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Jun 11, 2014

The low level formatting of magnetic drives contains ECC that permits
error correction. When a drive reads a bad sector, it will keep trying
to read it is reached in the hope that a different read attempt will
allow the ECC to reconstruct the bad data until an internal time out
occurs. This is useful on filesystems that cannot recover from drive
errors. However, it is unnecessary as ZFS can easily self heal when
errors occur.

We introduce a tunable called zfs_vdev_disk_timeout_ms that allows us to
tell the block layer to send the errors to ZFS earlier than normal. We
also introduce zfs_vdev_disk_timeout_ticks, which is the maximum number
of consecutive timeouts before a disk is marked as being bad.

Signed-off-by: Richard Yao ryao@gentoo.org

The low level formatting of magnetic drives contains ECC that permits
error correction. When a drive reads a bad sector, it will keep trying
to read it is reached in the hope that a different read attempt will
allow the ECC to reconstruct the bad data until an internal time out
occurs. This is useful on filesystems that cannot recover from drive
errors. However, it is unnecessary as ZFS can easily self heal when
errors occur.

We introduce a tunable called zfs_vdev_disk_timeout_ms that allows us to
tell the block layer to send the errors to ZFS earlier than normal. We
also introduce zfs_vdev_disk_timeout_ticks, which is the maximum number
of consecutive timeouts before a disk is marked as being bad.

Signed-off-by: Richard Yao <ryao@gentoo.org>
@FransUrbo
Copy link
Contributor

@ryao This sounds like something that could be useful in all cases - have ZFS deal with errors/problems instead of the disk?

@DeHackEd
Copy link
Contributor

I disagree with calling the number of failures "ticks" as that implies something time related and is often associated with the system clock or the HZ value.

@behlendorf
Copy link
Contributor

@ryao I like the idea of failing drives more aggressively and using the block layer error handlers to do it. However, I agree with the above comments about using "ticks" in the name of the failure counter. That typically implies scheduler timing ticks which is misleading in this context.

I also think we could make the vdev_disk_time_out_handler a bit smarter since we have access to the vdev and therefore know a little bit about the pool topology. Failing the disk more aggressively is definitely a good idea as long as it is part of either a mirror or raidz device. You wouldn't want to do this for a non-redundant disk. You also wouldn't want to do this if the mirror or raidz device is already degraded. Those are both things we should be able to determine in the error handler. There may be other interesting cases as well.

To test you change you may want to look at the scsi_debug module. It provides a handy way to introduce delayed to a simulated scsi device. That should allow you to trigger this code at will and make sure it works properly.

@behlendorf behlendorf added this to the 0.6.4 milestone Jun 11, 2014
@behlendorf behlendorf added Bug and removed Bug labels Jun 11, 2014
@behlendorf behlendorf changed the title Introduce module parameters for handling IO timeouts Introduce module parameters for handling IO timeouts (WIP) Jun 18, 2014
@ZNikke
Copy link

ZNikke commented Jun 26, 2014

This is a really good idea. I'm actually amazed that this wasn't implemented in zfs from the start.

I suspect that the assumption from the beginning was that everyone will use raid-edition storage devices which has a rather short timeout (~7s IIRC).

However, after having had to debug io-hang issues on several systems with expensive raid-edition HDDs I can for sure say that you can't rely on the timeout to actually work when shit hits the fan. In most cases there was a flakey HDD with buggy firmware, so instead of failing the request after ~7s it did recovery and eventually succeeded.

So, I fully approve of this idea. It would make the world a better place, etc :-)

However, to be perfect you would also have to handle the case of a storage device repeatedly doing error recovery that returns success rather quickly. Such a device would cause an IO slowdown, but not severe enough for the system to appear hung. Failing a drive because of this might be a bit harsh, but perhaps a pre/soft-fail state or just a separate slow-io-counter that increases just like the read/write/checksum error counters? Detection is also a bit of an issue, instead of hard-coded timeouts you might want to do some light statistics instead...

@ryao
Copy link
Contributor Author

ryao commented Jul 2, 2014

@ZNikke If the error recovery is being done on a root pool, the system will have terrible responsiveness issues. In that case, the drive should be considered defective.

That being said, we need a reliable way to set CCTL/TLER/ERC on drives that support it when ZFS has the whole disk. That is something that I would like to explicitly add to Linux's block layer when I find time.

@DeHackEd
Copy link
Contributor

It seems the bd_queue field of struct block_device was first introduced in kernel v3.3-rc which puts it out of the running for CentOS 6 users. :/

@DeHackEd
Copy link
Contributor

A few days ago @ryao and I were discussing how to actually test this. I suggested an iSCSI target followed by yanking the network cable (if figuratively). ryao suggested a device mapper target which could be suspended with dmsetup suspend $mappertarget. I ended up trying both, and then trying them together using an iSCSI target backed by the device mapper which I could suspend. The idea with that being it could keep the target itself responsive while blocking the I/O.

And then this happened: (zfs not present in backtrace)

 connection8:0: detected conn error (1021)
 connection8:0: detected conn error (1002)
------------[ cut here ]------------
kernel BUG at block/blk-softirq.c:112!
invalid opcode: 0000 [#1] SMP 
Modules linked in: sd_mod crc_t10dif autofs4 ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) zlib_deflate ppdev microcode pcspkr parport_pc parport virtio_net i2c_piix4 i2c_core virtio_blk(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F) [last unloaded: floppy]
CPU: 1 PID: 822 Comm: iscsid Tainted: PF          O 3.10.53 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88003cea90c0 ti: ffff88003daf6000 task.ti: ffff88003daf6000
RIP: 0010:[<ffffffff812882fc>]  [<ffffffff812882fc>] __blk_complete_request+0x11c/0x130
RSP: 0018:ffff88003daf7888  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880039d7cba8 RCX: 0000000000000000
RDX: ffff88003c8dc7f8 RSI: ffff88003daf7908 RDI: ffff88003c8ea1d0
RBP: ffff88003daf78a8 R08: ffff880020408dc0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff880020408dc0
R13: ffff88003c8ea1d0 R14: 0000000000000005 R15: 000000000000000e
FS:  00007ffa6b473700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004e8001 CR3: 000000003dae9000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff880039d7cba8 ffff880020408dc0 ffff88003d5182d0 0000000000000005
 ffff88003daf78b8 ffffffff81288335 ffff88003daf78d8 ffffffff813b0a76
 ffff880039d7cba8 ffff880020408dc0 ffff88003daf7928 ffffffffa02c6052
Call Trace:
 [<ffffffff81288335>] blk_complete_request+0x25/0x30
 [<ffffffff813b0a76>] scsi_done+0x26/0x60
 [<ffffffffa02c6052>] __iscsi_put_task+0xb2/0x110 [libiscsi]
 [<ffffffff8105cb14>] ? wake_up_worker+0x24/0x30
 [<ffffffffa02c614c>] iscsi_complete_task+0x9c/0x140 [libiscsi]
 [<ffffffff8105ed60>] ? __queue_work+0x160/0x360
 [<ffffffffa02c626c>] fail_scsi_task+0x7c/0xb0 [libiscsi]
 [<ffffffffa02c632e>] fail_scsi_tasks+0x8e/0xe0 [libiscsi]
 [<ffffffffa02c7bac>] iscsi_conn_stop+0x11c/0x2c0 [libiscsi]
 [<ffffffffa02dcc18>] iscsi_sw_tcp_conn_stop+0x78/0x90 [iscsi_tcp]
 [<ffffffffa02b1a12>] iscsi_if_rx+0x342/0x11e0 [scsi_transport_iscsi]
 [<ffffffff8114e96c>] ? __kmalloc_node_track_caller+0x3c/0x50
 [<ffffffff8146089b>] ? __kmalloc_reserve+0x3b/0xa0
 [<ffffffff8149b4d9>] netlink_unicast+0x169/0x1e0
 [<ffffffff8149c33b>] netlink_sendmsg+0x23b/0x340
 [<ffffffff814587e6>] sock_sendmsg+0xa6/0xd0
 [<ffffffff8145667e>] ? move_addr_to_kernel+0x4e/0x90
 [<ffffffff814655fd>] ? verify_iovec+0x8d/0x110
 [<ffffffff81459b41>] ___sys_sendmsg+0x421/0x440
 [<ffffffff8117b8d9>] ? mntput_no_expire+0x49/0x150
 [<ffffffff8117ba03>] ? mntput+0x23/0x40
 [<ffffffff8115f001>] ? __fput+0x191/0x270
 [<ffffffff8102fd3f>] ? kvm_clock_read+0x1f/0x30
 [<ffffffff8102fd59>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff81459d79>] __sys_sendmsg+0x49/0x90
 [<ffffffff81459dd9>] SyS_sendmsg+0x19/0x20
 [<ffffffff815475d9>] system_call_fastpath+0x16/0x1b
Code: af 66 0f 1f 44 00 00 89 de 44 89 f7 e8 5e 6c de ff e9 65 ff ff ff 66 0f 1f 84 00 00 00 00 00 bf 04 00 00 00 e8 f6 10 dc ff eb 85 <0f> 0b 66 90 eb fc 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 
RIP  [<ffffffff812882fc>] __blk_complete_request+0x11c/0x130
 RSP <ffff88003daf7888>
---[ end trace 0fa9bfdd11a09fb5 ]---
Kernel panic - not syncing: Fatal exception in interrupt

I don't know if my testing method is flawed or if the patch isn't working. Regardless I am having trouble validating that this works.

@DeHackEd
Copy link
Contributor

Sure, now it's working...

# cp randomdata2 /dev/null
cp: overwrite `/dev/null'? y
vdev_disk: timeout tick
vdev_disk: timeout tick
vdev_disk: timeout tick
vdev_disk: timeout tick
vdev_disk: too many ticks
vdev_disk: too many ticks
vdev_disk: too many ticks
vdev_disk: too many ticks
 connection1:0: detected conn error (1021)

The vdev_disk errors are just printk() calls I added to the patch in the vdev_disk_time_out_handler function. The I/O doesn't fail exactly on schedule according to what you would expect from the _ms and _tick variable values but low numbers at least makes the system unwedge itself without waiting TOO long.

I've had more stack dumps but it looks like it's iSCSI bugs. :/

@behlendorf
Copy link
Contributor

@DeHackEd I've used the scsi_debug kernel module in the past for this sort of testing. It allows you to easily simulate a misbehaving device by doing thing like acknowledging reads/write as good and then throwing out the data, fake_rw. Or in this case you could use the every_nth option to force every Nth scsi command to result in a timeout.

Thanks for taking the time to look in to testing this. If @ryao gets a chance to refresh the patch we'll be able to get the build results for various kernels and see how portable this is. A quick look at the kernel source suggests the core functionality goes all the way back to 2.6.28.

http://sg.danny.cz/sg/sdebug26.html

@DeHackEd
Copy link
Contributor

# zpool import
   pool: mailstore00
     id: 13039248741391054330
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:
    mailstore00  ONLINE
      sda       ONLINE

This is a real (bare metal) machine and this disk is a piece of crap and hangs when it feels like it. It happened while I was scrubbing.

# zpool events
TIME                           CLASS
Aug 27 2014 15:08:44.116844970 resource.fs.zfs.statechange
Aug 27 2014 15:08:44.478534312 ereport.fs.zfs.config.sync
Aug 27 2014 15:08:44.621353165 ereport.fs.zfs.config.sync
Aug 27 2014 15:50:44.353613618 ereport.fs.zfs.scrub.start
Aug 27 2014 15:51:56.539063380 ereport.fs.zfs.scrub.finish
Aug 29 2014 08:11:53.960893686 ereport.fs.zfs.scrub.start
Aug 29 2014 08:14:14.630557396 ereport.fs.zfs.io
Aug 29 2014 08:14:14.630588655 ereport.fs.zfs.io
Aug 29 2014 08:14:14.630633113 ereport.fs.zfs.io
Aug 29 2014 08:14:14.630659189 ereport.fs.zfs.io
Aug 29 2014 08:30:51.875099140 ereport.fs.zfs.delay
Aug 29 2014 08:47:35.395208849 ereport.fs.zfs.delay

Other info

# echo w > /proc/sysrq-trigger
# dmesg
...
...
sd 0:0:20:0: [sda]  
Sense Key : No Sense [current] 
sd 0:0:20:0: [sda]  
Add. Sense: No additional sense information
sd 0:0:20:0: [sda]  
Sense Key : No Sense [current] 
sd 0:0:20:0: [sda]  
Add. Sense: No additional sense information
sd 0:0:20:0: [sda]  
Sense Key : No Sense [current] 
sd 0:0:20:0: [sda]  
Add. Sense: No additional sense information
sd 0:0:20:0: [sda]  
Sense Key : No Sense [current] 
sd 0:0:20:0: [sda]  
Add. Sense: No additional sense information
SysRq : Show Blocked State
  task                        PC stack   pid father
txg_sync        D ffff88047fc31940     0  6749      2 0x00000080
 ffff8804695d4ad0 0000000000000046 ffff8804374bc280 0000000000011940
 0000000000011940 ffff8800736b2010 0000000000000202 ffff8800735f8790
 ffff8800736b3c18 ffff8800735f8760 0000000000000002 0000000000000000
Call Trace:
 [] ? cv_wait_common+0x125/0x1b0 [spl]
 [] ? bit_waitqueue+0xc0/0xc0
 [] ? ktime_get_ts+0x3d/0xe0
 [] ? dsl_scan_sync+0x5d3/0x760 [zfs]
 [] ? spa_sync+0x398/0x840 [zfs]
 [] ? __wake_up+0x43/0x70
 [] ? ktime_get_ts+0x3d/0xe0
 [] ? txg_sync_thread+0x35b/0x4c0 [zfs]
 [] ? set_user_nice+0xfb/0x190
 [] ? txg_dispatch_callbacks+0x150/0x150 [zfs]
 [] ? __thread_create+0x360/0x360 [spl]
 [] ? thread_generic_wrapper+0x75/0x90 [spl]
 [] ? kthread+0xce/0xf0
 [] ? kthread_freezable_should_stop+0x60/0x60
 [] ? ret_from_fork+0x7c/0xb0
 [] ? kthread_freezable_should_stop+0x60/0x60
# echo b > /proc/sysrq-trigger

Is scrub/resilver I/O somehow exempt from this? I've added printk around the vdev_disk_time_out_handler function but they don't seem to be firing. Default module paramteres (1000 ms alerts, 30 alerts = failure).

Who needs debugging software when you have real defective hardware? 👎

@behlendorf
Copy link
Contributor

@DeHackEd In this case it looks like vdev_disk_time_out_handler didn't need to run because the zio's did return errors reasonably quickly. What is strange is that the txg_sync thread is blocked waiting for the scrub IO. It should be woken even on error in dsl_scan_scrub_done.

@FransUrbo
Copy link
Contributor

@ryao what state is this pull request in?

@FransUrbo
Copy link
Contributor

@ryao Just tried compiling this on a 3.2.0-4-amd64 kernel and I get:

/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c: In function 'vdev_disk_open':
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:318:26: error: 'struct block_device' has no member named 'bd_queue'
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:319:6: error: 'struct block_device' has no member named 'bd_queue'
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:321:31: error: 'struct block_device' has no member named 'bd_queue'
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:322:6: error: 'struct block_device' has no member named 'bd_queue'
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c: In function 'vdev_disk_close':
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:358:7: error: 'struct block_device' has no member named 'bd_queue'
/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.c:359:7: error: 'struct block_device' has no member named 'bd_queue'
make[7]: *** [/usr/src/zfs/module/zfs/../../module/zfs/vdev_disk.o] Error 1

Works just fine on my production 3.17.0-rc5...

@DeHackEd
Copy link
Contributor

git blame on the kernel suggests this feature isn't available until v3.3-rc1. This should be tested for in the configure script.

torvalds/linux@87192a2

@DeHackEd
Copy link
Contributor

DeHackEd commented Oct 9, 2014

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffffa02452d8>] vdev_disk_time_out_handler+0x8/0x60 [zfs]
PGD 0 
Oops: 0000 [#1] SMP 
Modules linked in: netconsole bridge stp llc configfs cpufreq_ondemand nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi binfmt_misc dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support microcode serio_raw zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) sr_mod cdrom i2c_i801 lpc_ich mfd_core sg tpm_tis tpm ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core hwmon acpi_cpufreq usb_storage ext3 jbd megaraid_sas [last unloaded: netconsole]
CPU: 7 PID: 0 Comm: swapper/7 Tainted: P           O 3.14.19 #1
Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.0a       09/28/10  
task: ffff88042953f710 ti: ffff880429576000 task.ti: ffff880429576000
RIP: 0010:[<ffffffffa02452d8>]  [<ffffffffa02452d8>] vdev_disk_time_out_handler+0x8/0x60 [zfs]
RSP: 0018:ffff88043fdc3e28  EFLAGS: 00010096
RAX: 0000000000000000 RBX: ffff8803f688cc00 RCX: ffff880427fa9f70
RDX: ffff880427fa9f70 RSI: ffff88043fdc3e58 RDI: ffff8803f688cc00
RBP: ffff880427fa9a80 R08: 0000000000000000 R09: 0000000000000100
R10: ffffffff814b0365 R11: 0000000000000010 R12: ffff880427fa9e20
R13: ffff880427fa9f70 R14: ffff8804295f1428 R15: ffff88043fdc3e64
FS:  0000000000000000(0000) GS:ffff88043fdc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000001619000 CR4: 00000000000007e0
Stack:
 0000000000000007 ffffffff81234cd3 ffff88043fdd1040 ffff880427fa9a80
 0000000000000292 ffffffff81234ece 0000000000000000 000000002f00ab10
 ffff88043fdc3e68 0000000000000100 ffffffff81234e60 00000000000006b0
Call Trace:
 <IRQ> 
 [<ffffffff81234cd3>] ? blk_rq_timed_out+0x23/0xe0
 [<ffffffff81234ece>] ? blk_rq_timed_out_timer+0x6e/0xd0
 [<ffffffff81234e60>] ? blk_rq_check_expired+0x80/0x80
 [<ffffffff81062644>] ? call_timer_fn+0x24/0x90
 [<ffffffff81062881>] ? run_timer_softirq+0x1d1/0x1e0
 [<ffffffff8105cb9d>] ? __do_softirq+0xed/0x210
 [<ffffffff8105cdbd>] ? irq_exit+0xbd/0xd0
 [<ffffffff8102e22b>] ? smp_apic_timer_interrupt+0x3b/0x50
 [<ffffffff814a104a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
 [<ffffffff813b2ec8>] ? cpuidle_enter_state+0x48/0xd0
 [<ffffffff813b2ec4>] ? cpuidle_enter_state+0x44/0xd0
 [<ffffffff813b2fff>] ? cpuidle_idle_call+0xaf/0x130
 [<ffffffff8100bec9>] ? arch_cpu_idle+0x9/0x30
 [<ffffffff81098492>] ? cpu_idle_loop+0x92/0x170
Code: 89 ee 4c 89 e7 e8 89 f8 ff ff 48 83 c4 08 4c 89 ef 5b 5d 41 5c 41 5d e9 f7 83 25 e1 90 90 90 90 90 90 90 48 83 ec 08 48 8b 47 70 <48> 8b 40 50 48 8b 50 28 48 8b 92 10 02 00 00 48 8b 52 60 c7 40 
RIP  [<ffffffffa02452d8>] vdev_disk_time_out_handler+0x8/0x60 [zfs]
 RSP <ffff88043fdc3e28>
CR2: 0000000000000050
---[ end trace 2c2b459c672893c1 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

My tree: https://github.com/DeHackEd/zfs/commits/dehacked
Contains a pretty up-to-date ZFS base, ABD patches, and a small collection of patches including this one at the top.

Events leading up to the crash:

# zpool status
  pool: testsan
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h16m with 0 errors on Fri Aug  1 14:21:16 2014
config:
    NAME                                        STATE     READ WRITE CKSUM
    testsan                                     DEGRADED     0     0     0
      mirror-0                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafb0a2e1708  ONLINE       0     0     0
        scsi-1IET_00010001                      UNAVAIL      0     0     0
      mirror-1                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafb0a346114  ONLINE       0     0     0
        scsi-1IET_00010002                      UNAVAIL      0     0     0
      mirror-2                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafc0a3b56b3  ONLINE       0     0     0
        scsi-1IET_00010003                      UNAVAIL      0     0     0
      mirror-3                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafc0a429e92  ONLINE       0     0     0
        scsi-1IET_00010004                      UNAVAIL      0     0     0
      mirror-4                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafd0a4a8100  ONLINE       0     0     0
        scsi-1IET_00010005                      UNAVAIL      0     0     0
      mirror-5                                  DEGRADED     0     0     0
        scsi-36003048000848ed01b17aafd0a530c4c  ONLINE       0     0     0
        scsi-1IET_00010006                      UNAVAIL      0     0     0
errors: No known data errors
# zpool online testsan scsi-1IET_00010001
# zpool online testsan scsi-1IET_00010002
[crash]

@DeHackEd
Copy link
Contributor

DeHackEd commented Oct 9, 2014

I added some VERIFY() clauses at the top of vdev_disk_time_out_handler and hit this:

# zpool import testsan
VERIFY(req->bio != NULL) failed
SPLError: 0:0:(vdev_disk.c:115:vdev_disk_time_out_handler()) SPL PANIC
Kernel panic - not syncing: SPL PANIC in interrupt.
CPU: 7 PID: 0 Comm: swapper/7 Tainted: P           O 3.14.19 #1
Hardware name: Supermicro X8DTU/X8DTU, BIOS 2.0a       09/28/10  
 ffffffffa05fa370 ffffffffa060ee78 ffffffff8149b943 ffffffffa0147c32
 ffffffff8149b6f0 0000000000000008 ffff88043fdc3c68 ffff88043fdc3c08
 0000000000000073 ffffffffa05fa370 0000000000000082 0000000000000007
Call Trace:
 <IRQ>  [<ffffffff8149b943>] ? dump_stack+0x41/0x56
 [<ffffffff8149b6f0>] ? panic+0xbc/0x1c4
 [<ffffffffa0133f55>] ? spl_debug_bug+0xb5/0xd0 [spl]
 [<ffffffffa0140481>] ? spl_PANIC+0xb1/0xe0 [spl]
 [<ffffffff8108bf51>] ? enqueue_entity+0x1c1/0x8c0
 [<ffffffff81088624>] ? update_rq_runnable_avg+0xd4/0x1d0
 [<ffffffff8108c993>] ? enqueue_task_fair+0x343/0x450
 [<ffffffff8106dedc>] ? wq_worker_waking_up+0xc/0x60
 [<ffffffff8107d7f5>] ? check_preempt_curr+0x65/0x90
 [<ffffffff8107d831>] ? ttwu_do_wakeup+0x11/0x80
 [<ffffffffa05a14e6>] ? vdev_disk_time_out_handler+0xf6/0x130 [zfs]
 [<ffffffff81234cd3>] ? blk_rq_timed_out+0x23/0xe0
 [<ffffffff81234ece>] ? blk_rq_timed_out_timer+0x6e/0xd0
 [<ffffffff81234e60>] ? blk_rq_check_expired+0x80/0x80
 [<ffffffff81062644>] ? call_timer_fn+0x24/0x90
 [<ffffffff81062881>] ? run_timer_softirq+0x1d1/0x1e0
 [<ffffffff8105cb9d>] ? __do_softirq+0xed/0x210
 [<ffffffff8105cdbd>] ? irq_exit+0xbd/0xd0
 [<ffffffff8102e22b>] ? smp_apic_timer_interrupt+0x3b/0x50
 [<ffffffff814a104a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI>  [<ffffffff813b2ec8>] ? cpuidle_enter_state+0x48/0xd0
 [<ffffffff813b2ec4>] ? cpuidle_enter_state+0x44/0xd0
 [<ffffffff813b2fff>] ? cpuidle_idle_call+0xaf/0x130
 [<ffffffff8100bec9>] ? arch_cpu_idle+0x9/0x30
 [<ffffffff81098492>] ? cpu_idle_loop+0x92/0x170
 [<ffffffff81098360>] ? SyS_syslog+0x20/0x20
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

@behlendorf
Copy link
Contributor

Yes, that makes sense. It's entirely possible for requests not to have bios associated with them. The patch will need to be updated to handle this case.

@DeHackEd
Copy link
Contributor

Do you know how that could be done?

I'm kinda pressed to get this feature added, even if it's in a private repository. There doesn't seem to be an obvious workaround since I don't have access to the dio_request_t associated. Would making a hard time limit of 10 seconds an eliminating the "tick" system work?

I'll be playing with this tomorrow but any insight would be appreciated before I start.

@behlendorf
Copy link
Contributor

@DeHackEd my suggestion would be to move this state tracking in to the private dio_request_t structure. This is already managed by the vdev_disk layer so we can guarantee it will exist. Store the private state you need there.

@behlendorf
Copy link
Contributor

@dweeezil it occurs to me that if you're chasing unexplained disk timeouts this patch may be of use to you. It needs a little bit of work still but it does allow quickly timing out otherwise hung IO.

@dweeezil
Copy link
Contributor

dweeezil commented May 8, 2015

@behlendorf Thanks for the heads-up. I had forgotten about this.

@ryao
Copy link
Contributor Author

ryao commented May 8, 2015

I had as well. Thanks for commenting here.

@behlendorf
Copy link
Contributor

While hijacking the block devices timeout callback to notify ZFS of a problem looks desirable on the surface in practice implementing this is a safe, generic, reliable manor has proven to be problematic. The reality is we have minimal control over the SCSI layer error handling. If that changes we can revisit doing something better. In practice, production deployments should be using TLER drives anyway so this is less of an issue.

@behlendorf behlendorf closed this Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants