Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider backporting zio_data_buf_alloc kernel panic fix to 2.0.x #12494

Closed
danderson opened this issue Aug 20, 2021 · 2 comments
Closed

Consider backporting zio_data_buf_alloc kernel panic fix to 2.0.x #12494

danderson opened this issue Aug 20, 2021 · 2 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@danderson
Copy link

System information

Type Version/Name
Distribution Name NixOS
Distribution Version 21.05
Kernel Version 5.10.50
Architecture amd64
OpenZFS Version 2.0.5-1

Describe the problem you're observing

Issue #11531 identified a kernel panic triggered by the refactor in 13fac09, from Feb 2020. It was subsequently fixed in a81b812, in June 2021.

The fix was only applied to master and subsequently the 2.1 branch. From git log spelunking, the bug is present in all releases in the 2.0.x track, which - given 2.1's recent arrival, is likely the version in use by many downstream distros that don't do bleeding edge rolling release.

a81b812 states that the revert wasn't trivially clean, so I'd like to request that OpenZFS maintainers backport the fix to the 2.0 branch and make a 2.0.6 release, rather than gamble on downstream distro maintainers trying to apply this patch themselves.

I'm unsure if OpenZFS overlaps maintenance of the latest and previous stable release tracks for a while, or if the project's view is that downstream consumers should upgrade to 2.1 (which is a much larger delta) as soon as it releases, because 2.0.x becomes EOL. I looked around for a backporting or "support lifetime" policy, but failed to find one.

Describe how to reproduce the problem

Various reproduction steps described in #11531. In my case, the reproduction steps were: run a NAS on NixOS 21.05, run a bunch of software on it that does some reasonably heavy I/O, and get a kernel panic within a few days.

Include any warning/errors/backtraces from the system logs

Stack trace from my panicked server, which matches traces from #11531.

VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
PANIC at zio.c:341:zio_data_buf_alloc()
Showing stack for process 2235123
CPU: 11 PID: 2235123 Comm: transmission-da Tainted: P           O      5.10.50 #1-NixOS
Hardware name: Supermicro SSG-5028R-E1CR12LA-CE010/X10SRH-CLN4F, BIOS 3.2 11/22/2019
Call Trace:
 dump_stack+0x6b/0x83
 spl_panic+0xd4/0xfc [spl]
 ? spl_kmem_cache_alloc+0x75/0x790 [spl]
 ? kmem_cache_alloc+0xda/0x1d0
 ? spl_kmem_cache_alloc+0x98/0x790 [spl]
 ? aggsum_add+0x175/0x190 [zfs]
 ? mutex_lock+0xe/0x30
 ? aggsum_add+0x175/0x190 [zfs]
 zio_data_buf_alloc+0x55/0x60 [zfs]
 abd_alloc_linear+0x8a/0xc0 [zfs]
 arc_hdr_alloc_abd+0xdf/0x200 [zfs]
 arc_hdr_alloc+0x104/0x170 [zfs]
 arc_alloc_buf+0x46/0x150 [zfs]
 dbuf_hold_copy.constprop.0+0x31/0xa0 [zfs]
 dbuf_hold_impl+0x476/0x660 [zfs]
 dbuf_hold+0x2c/0x60 [zfs]
 dmu_buf_hold_array_by_dnode+0xdd/0x570 [zfs]
 dmu_read_uio_dnode+0x49/0x140 [zfs]
 ? zfs_rangelock_enter_impl+0x269/0x650 [zfs]
 dmu_read_uio_dbuf+0x42/0x60 [zfs]
 zfs_read+0x130/0x3a0 [zfs]
 zpl_iter_read+0xe4/0x190 [zfs]
 new_sync_read+0x115/0x1a0
 vfs_read+0x14b/0x1a0
 __x64_sys_pread64+0x8d/0xc0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f494c8b5fcf
f 77 35 44 89 c7 48 89 44 24 08 e8 7c f4 ff ff 48
RSP: 002b:00007f494b35c8b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000000018 RCX: 00007f494c8b5fcf
RDX: 0000000000004000 RSI: 00007f494574c000 RDI: 0000000000000018
RBP: 00007f494574c000 R08: 0000000000000000 R09: 00007f494b35c990
R10: 00000000ceff0000 R11: 0000000000000293 R12: 00000000ceff0000
R13: 0000000000004000 R14: 00007f49043d6000 R15: 0000000000000000
@danderson danderson added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 20, 2021
@aerusso
Copy link
Contributor

aerusso commented Aug 21, 2021

FYI: I've got this in #12346, which is queued for review

@rincebrain
Copy link
Contributor

rincebrain commented Mar 23, 2022

Closing; this made it into 2.0.6.

(If it still repros anyway, please let us know!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants