-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unloading zfs stack causes a BUG #828
Comments
Thanks for filing the bug, can you include the exact spl/zfs versions, kernel version, and architecture your seeing the failures on. |
Whoops, sorry about forgetting that. Here comes: Linux storagetest 3.2.13-damicon-scst #2 SMP Mon Mar 26 16:11:22 EEST 2012 x86_64 GNU/Linux Vanilla sources + scst (http://scst.sourceforge.net/) patches. spl at openzfs/spl@33f507c |
Got another one after stressing ZoL with more 8k testing, this time with reads. 4 hours seems sufficient to manifest this bug around 50% of the time. Oh, both SPL and ZFS are now at latest git master as of 17.7. [4865699.739735] Pid: 2267, comm: modprobe Tainted: P O 3.2.13-damicon-scst #2 Supermicro X8DT6/X8DT6 [4865699.739788] RIP: 0010:[] [] l2arc_evict+0x11c/0x310 [zfs] [4865699.739865] RSP: 0018:ffff8804bca0fd88 EFLAGS: 00000202 [4865699.739891] RAX: 0000000000000220 RBX: ffffffffa02684cb RCX: 7f29d8ef13f8b391 [4865699.739934] RDX: 0000000000000064 RSI: 00000000001c8217 RDI: ffffffffa061f290 [4865699.739977] RBP: ffff8804bca0fdd8 R08: ffff88007b9b7640 R09: 0000000000000001 [4865699.740020] R10: ffff88013c826400 R11: ffff88003bf0e8c8 R12: ffffffffa061fed0 [4865699.740062] R13: 0000000000000020 R14: ffff8800488b0260 R15: ffff8803544b9410 [4865699.740105] FS: 00007f07145ec700(0000) GS:ffff880627c60000(0000) knlGS:0000000000000000 [4865699.740150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4865699.740177] CR2: 00007fd9d030f000 CR3: 0000000ab0d76000 CR4: 00000000000006e0 [4865699.740220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [4865699.740263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [4865699.740306] Process modprobe (pid: 2267, threadinfo ffff8804bca0e000, task ffff880614a2ada0) [4865699.740350] Stack: [4865699.740370] 0000000000000064 0000000c2d139200 ffff8804cf4a6ba0 0000000100000009 [4865699.740423] ffff8809717db000 ffff8804cf4a6ba0 ffff8805f540f000 ffff8805f540f000 [4865699.740474] ffff8804bca0fe10 00000000015e0170 ffff8804bca0fdf8 ffffffffa0553d67 [4865699.740525] Call Trace: [4865699.740558] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [4865699.740608] [] spa_l2cache_drop+0x78/0xa0 [zfs] [4865699.740655] [] spa_unload+0x9f/0x210 [zfs] [4865699.740701] [] spa_evict_all+0x6f/0xb0 [zfs] [4865699.740749] [] spa_fini+0x13/0x130 [zfs] [4865699.740797] [] _fini+0x73/0x100 [zfs] [4865699.740843] [] spl__fini+0x13/0x40 [zfs] [4865699.740873] [] sys_delete_module+0x19e/0x270 [4865699.740903] [] ? do_munmap+0x2ae/0x370 [4865699.740932] [] system_call_fastpath+0x16/0x1b [4865699.740959] Code: 21 fa 48 89 d0 48 89 55 b0 48 c1 e0 06 4c 8d a8 90 d9 61 a0 4c 89 ef e8 33 23 00 e1 48 8b 55 b0 85 c0 0f 84 b9 01 00 00 8b 43 60 00 00 04 00 0f 85 f9 00 00 00 44 8b 4d cc 45 85 c9 75 28 48 [4865699.741163] Call Trace: [4865699.741194] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [4865699.741241] [] spa_l2cache_drop+0x78/0xa0 [zfs] [4865699.741287] [] spa_unload+0x9f/0x210 [zfs] [4865699.741332] [] spa_evict_all+0x6f/0xb0 [zfs] [4865699.741379] [] spa_fini+0x13/0x130 [zfs] [4865699.741425] [] _fini+0x73/0x100 [zfs] [4865699.741471] [] spl__fini+0x13/0x40 [zfs] [4865699.741498] [] sys_delete_module+0x19e/0x270 [4865699.741527] [] ? do_munmap+0x2ae/0x370 [4865699.741554] [] system_call_fastpath+0x16/0x1b [4865713.462927] ZFS: Unloaded module v0.6.0-rc9 [4865714.430720] SPL: Unloaded module v0.6.0-rc9 This time the pool was configured with 2x 120GB SSDs. Did a test earlier with just 1 SSD, and unloading ZoL stack that time didn't bug out. Maybe one needs more than 1 SSD in the pool to bug it out like this, I dunno. |
It seems like the arc hdr_cache cache is somehow being destroyed before everything gets evicted from the l2arc. How exactly this can happen isn't clear. |
Another one, this after stressing SLOG writes (sync). 2x 120GB SSD in pool as cache. spl+zfs from 19.7. git master. So far I tried to produce this bug today and yesterday with other blocksize zvols, but only 8k volblocksize zvol with 8k writes/reads causes enough stress to reliably reproduce this. @behlendorf - anything you want me to try to narrow this down? So far it's not straight out crashed the machine, but it's annoying. Also, 8k zvols are horribly slow. Looks like 8k zvols (and 8k+below reads/writes) really stress some part of ZoL implementation. Literally a simple [4955958.710255] Pid: 30347, comm: modprobe Tainted: P O 3.2.13-damicon-scst #2 Supermicro X8DT6/X8DT6 [4955958.710313] RIP: 0010:[] [] l2arc_evict+0x11c/0x310 [zfs] [4955958.710394] RSP: 0018:ffff880a87dffd88 EFLAGS: 00000202 [4955958.710422] RAX: 0000000000000220 RBX: ffffffffa02684cb RCX: 34cf6a2b93ddb7a3 [4955958.710468] RDX: 00000000000000ee RSI: 00000000001d7b75 RDI: ffffffffa06214f0 [4955958.710514] RBP: ffff880a87dffdd8 R08: ffff8802704c23f8 R09: 0000000000000000 [4955958.710559] R10: ffff880956468400 R11: ffff8805b3946178 R12: ffffffffa061e5b0 [4955958.710605] R13: 0000000000000020 R14: ffff880258c8d7e0 R15: ffff8807ec904000 [4955958.710652] FS: 00007f6f0da63700(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000 [4955958.710700] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4955958.710729] CR2: 00007f84f32a5730 CR3: 00000004c7a51000 CR4: 00000000000006e0 [4955958.710776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [4955958.710822] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [4955958.710868] Process modprobe (pid: 30347, threadinfo ffff880a87dfe000, task ffff8807effb16d0) [4955958.710916] Stack: [4955958.710938] 00000000000000ee 00000006e9172400 ffff8801796a3de0 0000000100000009 [4955958.710996] ffff8807eb8bb000 ffff8801796a3de0 ffff8803f7103000 ffff8803f7103000 [4955958.711052] ffff880a87dffe10 000000000218d170 ffff880a87dffdf8 ffffffffa0553d67 [4955958.711109] Call Trace: [4955958.711151] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [4955958.711216] [] spa_l2cache_drop+0x78/0xa0 [zfs] [4955958.711281] [] spa_unload+0x9f/0x210 [zfs] [4955958.711344] [] spa_evict_all+0x6f/0xb0 [zfs] [4955958.711409] [] spa_fini+0x13/0x130 [zfs] [4955958.711473] [] _fini+0x73/0x100 [zfs] [4955958.711537] [] spl__fini+0x13/0x40 [zfs] [4955958.711571] [] sys_delete_module+0x19e/0x270 [4955958.711605] [] ? do_munmap+0x2ae/0x370 [4955958.711640] [] system_call_fastpath+0x16/0x1b [4955958.711669] Code: 21 fa 48 89 d0 48 89 55 b0 48 c1 e0 06 4c 8d a8 70 d9 61 a0 4c 89 ef e8 33 23 00 e1 48 8b 55 b0 85 c0 0f 84 b9 01 00 00 8b 43 60 00 00 04 00 0f 85 f9 00 00 00 44 8b 4d cc 45 85 c9 75 28 48 [4955958.711904] Call Trace: [4955958.711945] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [4955958.712009] [] spa_l2cache_drop+0x78/0xa0 [zfs] [4955958.712073] [] spa_unload+0x9f/0x210 [zfs] [4955958.712136] [] spa_evict_all+0x6f/0xb0 [zfs] [4955958.712200] [] spa_fini+0x13/0x130 [zfs] [4955958.712265] [] _fini+0x73/0x100 [zfs] [4955958.712329] [] spl__fini+0x13/0x40 [zfs] [4955958.712360] [] sys_delete_module+0x19e/0x270 [4955958.712392] [] ? do_munmap+0x2ae/0x370 [4955958.712422] [] system_call_fastpath+0x16/0x1b [4955969.733825] ZFS: Unloaded module v0.6.0-rc9 [4955970.541101] SPL: Unloaded module v0.6.0-rc9 |
As a temporary workaround you might try removing the l2arc device from the pool before unloading the modules. |
@behlendorf - got another one, this time while removing the L2ARC device from pool after a stresstest. Here's the stack trace: [5608739.177785] Pid: 30117, comm: zpool Tainted: P O 3.2.13-damicon-scst #2 Supermicro X8DT6/X8DT6 [5608739.177843] RIP: 0010:[] [] arc_change_state+0xee/0x2f0 [zfs] [5608739.177930] RSP: 0018:ffff880177affc68 EFLAGS: 00000207 [5608739.177958] RAX: 0000000000000000 RBX: ffffffffa018f4cb RCX: ffffffffffffe000 [5608739.178004] RDX: 0000000000000001 RSI: dead000000200200 RDI: dead000000100100 [5608739.178049] RBP: ffff880177affca8 R08: ffff880b56ce5c58 R09: 0000000000000001 [5608739.178095] R10: ffff8803ab505400 R11: ffff880be54bd8a0 R12: ffff8803b3199600 [5608739.178140] R13: ffff880177affc38 R14: ffff8803ab505400 R15: ffff880177affc78 [5608739.178186] FS: 00007fd5e6fc3be0(0000) GS:ffff880627c00000(0000) knlGS:0000000000000000 [5608739.178234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [5608739.178263] CR2: 00007f022b9a24f8 CR3: 0000000af7344000 CR4: 00000000000006e0 [5608739.178309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [5608739.178354] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [5608739.178400] Process zpool (pid: 30117, threadinfo ffff880177afe000, task ffff880616ed0000) [5608739.178448] Stack: [5608739.178469] ffffffff00000001 0000000000000000 ffff880177affca8 ffff880b56ce5c48 [5608739.178527] ffff8808fc9308c8 ffffffffa0726570 ffff8801f4e97a70 ffff8801f4e97a60 [5608739.178583] ffff880177affd08 ffffffffa065ac43 0000000000000070 000000184c7de600 [5608739.178639] Call Trace: [5608739.178682] [] l2arc_evict+0x2b3/0x310 [zfs] [5608739.178730] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [5608739.178797] [] spa_load_l2cache+0x1c3/0x4d0 [zfs] [5608739.178863] [] spa_vdev_remove+0x335/0x3a0 [zfs] [5608739.178932] [] zfs_ioc_vdev_remove+0x37/0x60 [zfs] [5608739.179000] [] ? pool_status_check+0x62/0xa0 [zfs] [5608739.179068] [] zfsdev_ioctl+0xfd/0x1c0 [zfs] [5608739.179101] [] do_vfs_ioctl+0x9b/0x4f0 [5608739.179132] [] ? call_rcu_sched+0x15/0x20 [5608739.179162] [] sys_ioctl+0xa1/0xb0 [5608739.179194] [] system_call_fastpath+0x16/0x1b [5608739.179223] Code: 00 00 4c 3b 2d 1c b1 0c 00 0f 84 66 01 00 00 4c 3b 2d 17 b1 0c 00 0f 84 59 01 00 00 4d 89 f7 4c 89 f9 48 f7 d9 f0 49 01 4c c5 40 <85> d2 0f 85 c2 01 00 00 48 8b 05 cb b0 0c 00 48 39 c3 0f 84 f2 [5608739.179456] Call Trace: [5608739.179498] [] l2arc_evict+0x2b3/0x310 [zfs] [5608739.179546] [] l2arc_remove_vdev+0xc7/0x140 [zfs] [5608739.179610] [] spa_load_l2cache+0x1c3/0x4d0 [zfs] [5608739.179675] [] spa_vdev_remove+0x335/0x3a0 [zfs] [5608739.179743] [] zfs_ioc_vdev_remove+0x37/0x60 [zfs] [5608739.179811] [] ? pool_status_check+0x62/0xa0 [zfs] [5608739.179878] [] zfsdev_ioctl+0xfd/0x1c0 [zfs] [5608739.179908] [] do_vfs_ioctl+0x9b/0x4f0 [5608739.179938] [] ? call_rcu_sched+0x15/0x20 [5608739.179967] [] sys_ioctl+0xa1/0xb0 [5608739.179997] [] system_call_fastpath+0x16/0x1b |
I haven't been able to replicate this issue any more, so I guess this can be closed as well. |
The zettacache does aggregation of contiguous reads, meaning that we will do one large `pread()` syscall instead of several small, contiguous `pread()`'s. This increases performance by decreasing the iops load. However, if there are small gaps between blocks that we need to read, aggregation will not be performed, leading to decreased performance both due to increasing the iops load, and also the implementation issues subsequent almost-contiguous i/os (up to `DISK_AGG_CHUNK`) serially (i.e. one after another, rather than in parallel) This commit improves performance by allowing small gaps (up to `DISK_READ_GAP_LIMIT`=8KB) to be spanned over to create a larger aggregate read. On a workload of 1MB reads from 4 files (with recordsize=8k), which were written concurrently, performance doubles (on disks that can do 15K IOPS, we go from 14K IOPS to 33K IOPS).
While testing the slow zvol issue #824 I often encounter this bug while doing a
modprobe -r zfs zcommon znvpair zavl zunicode spl
. Stack trace follows:However, system stays responsive and loading ZFS stack again doesn't cause issues.
This bug only manifests itself after stressing
volblocksize=8k
zvols with 8k iosize writes for a few hours. After that stresstest, every zfs/zpool command is slow (slab fragmentation?), taking several seconds to complete with just 10 zfs filesystems.echo 3 > /proc/sys/vm/drop_caches
doesn't return responsiveness, onlymodprobe -r
followed by reloading ZFS stack returns things back to normal.The text was updated successfully, but these errors were encountered: