Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enclosure Management Integration #10

Closed
behlendorf opened this issue May 14, 2010 · 3 comments
Closed

Enclosure Management Integration #10

behlendorf opened this issue May 14, 2010 · 3 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@behlendorf
Copy link
Contributor

When ZFS fails a drive we need to be able to automatically do some minimal enclosure management. Luckily modern kernels provide a ses and enclosure module here which we may be able to leverage. Minimally in this case we want to set the fault led for this device in the enclosure. This could probably be done in kernel but that assumes that these modules provide a reasonable API. A simpler and perhaps more flexible solution would be to have zeventd set the fault led through the standard /sys/ interface. Regardless this needs to be thought through.

echo 1 >/sys/class/enclosure/6:0:9:0/000/fault

@behlendorf
Copy link
Contributor Author

It was determined the best solution to the issue it to use sg3utils and bypass the ses module. It has been observed to drastically increase module load time for complicated configurations and thus may be blacklisted. We should not depend on it being available.

@FransUrbo
Copy link
Contributor

@behlendorf With ZED now a reality, is this still an issue?

@behlendorf
Copy link
Contributor Author

There's still a legitimate issue here. I've opened #2375 to clearly the required features now that we have the ZED. I'll close this one out.

ryao added a commit to ryao/zfs that referenced this issue Dec 3, 2015
(gdb) bt
\#0  0x00007f31952a35d7 in raise () from /lib64/libc.so.6
\#1  0x00007f31952a4cc8 in abort () from /lib64/libc.so.6
\#2  0x00007f319529c546 in __assert_fail_base () from /lib64/libc.so.6
\#3  0x00007f319529c5f2 in __assert_fail () from /lib64/libc.so.6
\#4  0x00007f319529c66b in __assert () from /lib64/libc.so.6
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
\openzfs#6  0x00007f319659e35d in send_iterate_fs (zhp=zhp@entry=0x19cf500, arg=arg@entry=0x7ffc515d6380) at libzfs_sendrecv.c:762
\openzfs#7  0x00007f319659f3ca in gather_nvlist (hdl=<optimized out>, fsname=fsname@entry=0x19cf250 "tpool/test", fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", recursive=B_FALSE,
    nvlp=nvlp@entry=0x7ffc515d6470, avlp=avlp@entry=0x7ffc515d6478) at libzfs_sendrecv.c:809
\openzfs#8  0x00007f31965a408f in zfs_send (zhp=zhp@entry=0x19cf240, fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", flags=flags@entry=0x7ffc515d6d30, outfd=outfd@entry=1,
    filter_func=filter_func@entry=0x0, cb_arg=cb_arg@entry=0x0, debugnvp=debugnvp@entry=0x0) at libzfs_sendrecv.c:1461
\openzfs#9  0x000000000040a981 in zfs_do_send (argc=<optimized out>, argv=0x7ffc515d6ff0) at zfs_main.c:3841
\openzfs#10 0x0000000000404d10 in main (argc=6, argv=0x7ffc515d6fc8) at zfs_main.c:6724
(gdb) fr 5
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
724                             verify(nvlist_lookup_uint64(propnv,
(gdb) list
719                             verify(nvlist_lookup_string(propnv,
720                                 ZPROP_VALUE, &value) == 0);
721                             VERIFY(0 == nvlist_add_string(nv, propname, value));
722                     } else {
723                             uint64_t value;
724                             verify(nvlist_lookup_uint64(propnv,
725                                 ZPROP_VALUE, &value) == 0);
726                             VERIFY(0 == nvlist_add_uint64(nv, propname, value));
727                     }
728             }
(gdb) p prop
$1 = ZFS_PROP_RELATIME
akatrevorjay added a commit to akatrevorjay/zfs that referenced this issue Dec 16, 2017
# This is the 1st commit message:
Merge branch 'master' of https://github.com/zfsonlinux/zfs

* 'master' of https://github.com/zfsonlinux/zfs:
  Enable QAT support in zfs-dkms RPM

# This is the commit message openzfs#2:

Import 0.6.5.7-0ubuntu3

# This is the commit message openzfs#3:

gbp changes

# This is the commit message openzfs#4:

Bump ver

# This is the commit message openzfs#5:

-j9 baby

# This is the commit message openzfs#6:

Up

# This is the commit message openzfs#7:

Yup

# This is the commit message openzfs#8:

Add new module

# This is the commit message openzfs#9:

Up

# This is the commit message openzfs#10:

Up

# This is the commit message openzfs#11:

Bump

# This is the commit message openzfs#12:

Grr

# This is the commit message openzfs#13:

Yay

# This is the commit message openzfs#14:

Yay

# This is the commit message openzfs#15:

Yay

# This is the commit message openzfs#16:

Yay

# This is the commit message openzfs#17:

Yay

# This is the commit message openzfs#18:

Yay

# This is the commit message openzfs#19:

yay

# This is the commit message openzfs#20:

yay

# This is the commit message openzfs#21:

yay

# This is the commit message openzfs#22:

Update ppa script

# This is the commit message openzfs#23:

Update gbp conf with br changes

# This is the commit message openzfs#24:

Update gbp conf with br changes

# This is the commit message openzfs#25:

Bump

# This is the commit message openzfs#26:

No pristine

# This is the commit message openzfs#27:

Bump

# This is the commit message openzfs#28:

Lol whoops

# This is the commit message openzfs#29:

Fix name

# This is the commit message openzfs#30:

Fix name

# This is the commit message openzfs#31:

rebase

# This is the commit message openzfs#32:

Bump

# This is the commit message openzfs#33:

Bump

# This is the commit message openzfs#34:

Bump

# This is the commit message openzfs#35:

Bump

# This is the commit message openzfs#36:

ntrim

# This is the commit message openzfs#37:

Bump

# This is the commit message openzfs#38:

9

# This is the commit message openzfs#39:

Bump

# This is the commit message openzfs#40:

Bump

# This is the commit message openzfs#41:

Bump

# This is the commit message openzfs#42:

Revert "9"

This reverts commit de488f1.

# This is the commit message openzfs#43:

Bump

# This is the commit message openzfs#44:

Account for zconfig.sh being removed

# This is the commit message openzfs#45:

Bump

# This is the commit message openzfs#46:

Add artful

# This is the commit message openzfs#47:

Add in zed.d and zpool.d scripts

# This is the commit message openzfs#48:

Bump

# This is the commit message openzfs#49:

Bump

# This is the commit message openzfs#50:

Bump

# This is the commit message openzfs#51:

Bump

# This is the commit message openzfs#52:

ugh

# This is the commit message openzfs#53:

fix zed upgrade

# This is the commit message openzfs#54:

Bump

# This is the commit message openzfs#55:

conf file zed.d

# This is the commit message #56:

Bump
shartse pushed a commit to shartse/zfs that referenced this issue Jun 21, 2018
markroper added a commit to markroper/zfs that referenced this issue Feb 12, 2020
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

     #0 [ffffc9002b98adc8] __schedule at ffffffff81610f2e
     openzfs#1 [ffffc9002b98ae68] schedule at ffffffff81611558
     openzfs#2 [ffffc9002b98ae70] schedule_preempt_disabled at ffffffff8161184a
     openzfs#3 [ffffc9002b98ae78] __mutex_lock at ffffffff816131e8
     openzfs#4 [ffffc9002b98af18] arc_buf_destroy at ffffffffa0bf37d7 [zfs]
     openzfs#5 [ffffc9002b98af48] dbuf_destroy at ffffffffa0bfa6fe [zfs]
     openzfs#6 [ffffc9002b98af88] dbuf_evict_one at ffffffffa0bfaa96 [zfs]
     openzfs#7 [ffffc9002b98afa0] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
     openzfs#8 [ffffc9002b98b050] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
     openzfs#9 [ffffc9002b98b100] osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    openzfs#10 [ffffc9002b98b118] lu_object_free at ffffffffa06d6a74 [obdclass]
    openzfs#11 [ffffc9002b98b178] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    openzfs#12 [ffffc9002b98b220] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    openzfs#13 [ffffc9002b98b278] shrink_slab at ffffffff811ca9d8
    openzfs#14 [ffffc9002b98b338] shrink_node at ffffffff811cfd94
    openzfs#15 [ffffc9002b98b3b8] do_try_to_free_pages at ffffffff811cfe63
    openzfs#16 [ffffc9002b98b408] try_to_free_pages at ffffffff811d01c4
    openzfs#17 [ffffc9002b98b488] __alloc_pages_slowpath at ffffffff811be7f2
    openzfs#18 [ffffc9002b98b580] __alloc_pages_nodemask at ffffffff811bf3ed
    openzfs#19 [ffffc9002b98b5e0] new_slab at ffffffff81226304
    openzfs#20 [ffffc9002b98b638] ___slab_alloc at ffffffff812272ab
    openzfs#21 [ffffc9002b98b6f8] __slab_alloc at ffffffff8122740c
    openzfs#22 [ffffc9002b98b708] kmem_cache_alloc at ffffffff81227578
    openzfs#23 [ffffc9002b98b740] spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    openzfs#24 [ffffc9002b98b780] arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    openzfs#25 [ffffc9002b98b7b0] arc_read at ffffffffa0bf0924 [zfs]
    openzfs#26 [ffffc9002b98b858] dbuf_read at ffffffffa0bf9083 [zfs]
    openzfs#27 [ffffc9002b98b900] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Signed-off-by: Mark Roper <markroper@gmail.com>
allanjude pushed a commit to KlaraSystems/zfs that referenced this issue Apr 28, 2020
Evict refcount=1 entries when DDT large
PrivatePuffin pushed a commit to PrivatePuffin/zfs that referenced this issue Jun 18, 2020
Remove -Z from the list of invalid zdb parameters
problame added a commit to problame/zfs that referenced this issue Oct 10, 2020
This is a fixup of commit 0fdd610

See added test case for a reproducer.

Stack trace:

    panic: VERIFY3(nvlist_next_nvpair(redactnvl, pair) == NULL) failed (0xfffff80003ce5d18x == 0x)

    cpuid = 7
    time = 1602212370
    KDB: stack backtrace:
    #0 0xffffffff80c1d297 at kdb_backtrace+0x67
    openzfs#1 0xffffffff80bd05cd at vpanic+0x19d
    openzfs#2 0xffffffff828446fa at spl_panic+0x3a
    openzfs#3 0xffffffff828af85d at dmu_redact_snap+0x39d
    openzfs#4 0xffffffff829c0370 at zfs_ioc_redact+0xa0
    openzfs#5 0xffffffff829bba44 at zfsdev_ioctl_common+0x4a4
    openzfs#6 0xffffffff8284c3ed at zfsdev_ioctl+0x14d
    openzfs#7 0xffffffff80a85ead at devfs_ioctl+0xad
    openzfs#8 0xffffffff8122a46c at VOP_IOCTL_APV+0x7c
    openzfs#9 0xffffffff80cb0a3a at vn_ioctl+0x16a
    openzfs#10 0xffffffff80a8649f at devfs_ioctl_f+0x1f
    openzfs#11 0xffffffff80c3b55e at kern_ioctl+0x2be
    openzfs#12 0xffffffff80c3b22d at sys_ioctl+0x15d
    openzfs#13 0xffffffff810a88e4 at amd64_syscall+0x364
    openzfs#14 0xffffffff81082330 at fast_syscall_common+0x101

Signed-off-by: Christian Schwarz <me@cschwarz.com>
sdimitro added a commit to sdimitro/zfs that referenced this issue Dec 7, 2021
rob-wing pushed a commit to KlaraSystems/zfs that referenced this issue Feb 17, 2023
Under certain loads, the following panic is hit:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    openzfs#4 0xffffffff8066fdee at vinactivef+0xde
    openzfs#5 0xffffffff80670b8a at vgonel+0x1ea
    openzfs#6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

A race condition can occur when allocating a new vnode and adding that
vnode to the vfs hash. If the newly created vnode loses the race when
being inserted into the vfs hash, it will not be recycled as its
usecount is greater than zero, hitting the above assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700

Signed-off-by:  Rob Wing <rob.wing@klarasystems.com>
Sponsored-by:   rsync.net
Sponsored-by:   Klara, Inc.
rob-wing pushed a commit to KlaraSystems/zfs that referenced this issue Feb 17, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    openzfs#4 0xffffffff808adc6f at trap_pfault+0x4f
    openzfs#5 0xffffffff80886da8 at calltrap+0x8
    openzfs#6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    openzfs#4 0xffffffff8066fdee at vinactivef+0xde
    openzfs#5 0xffffffff80670b8a at vgonel+0x1ea
    openzfs#6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700

Signed-off-by:  Rob Wing <rob.wing@klarasystems.com>
Submitted-by:   Klara, Inc.
Sponsored-by:   rsync.net
behlendorf pushed a commit that referenced this issue Feb 22, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    #7 0xffffffff80669841 at vgone+0x31
    #8 0xffffffff8065806d at vfs_hash_insert+0x26d
    #9 0xffffffff81a39069 at sfs_vgetx+0x149
    #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #11 0xffffffff8065a28c at lookup+0x45c
    #12 0xffffffff806594b9 at namei+0x259
    #13 0xffffffff80676a33 at kern_statat+0xf3
    #14 0xffffffff8067712f at sys_fstatat+0x2f
    #15 0xffffffff808ae50c at amd64_syscall+0x10c
    #16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    #8 0xffffffff81a39069 at sfs_vgetx+0x149
    #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #10 0xffffffff80661c2c at lookup+0x45c
    #11 0xffffffff80660e59 at namei+0x259
    #12 0xffffffff8067e3d3 at kern_statat+0xf3
    #13 0xffffffff8067eacf at sys_fstatat+0x2f
    #14 0xffffffff808b5ecc at amd64_syscall+0x10c
    #15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes #14501
behlendorf pushed a commit to behlendorf/zfs that referenced this issue May 28, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
behlendorf pushed a commit that referenced this issue May 30, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    #7 0xffffffff80669841 at vgone+0x31
    #8 0xffffffff8065806d at vfs_hash_insert+0x26d
    #9 0xffffffff81a39069 at sfs_vgetx+0x149
    #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #11 0xffffffff8065a28c at lookup+0x45c
    #12 0xffffffff806594b9 at namei+0x259
    #13 0xffffffff80676a33 at kern_statat+0xf3
    #14 0xffffffff8067712f at sys_fstatat+0x2f
    #15 0xffffffff808ae50c at amd64_syscall+0x10c
    #16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    #8 0xffffffff81a39069 at sfs_vgetx+0x149
    #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    #10 0xffffffff80661c2c at lookup+0x45c
    #11 0xffffffff80660e59 at namei+0x259
    #12 0xffffffff8067e3d3 at kern_statat+0xf3
    #13 0xffffffff8067eacf at sys_fstatat+0x2f
    #14 0xffffffff808b5ecc at amd64_syscall+0x10c
    #15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes #14501
EchterAgo pushed a commit to EchterAgo/zfs that referenced this issue Aug 4, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    openzfs#1 0xffffffff8058e86f at vpanic+0x17f
    openzfs#2 0xffffffff8058e6e3 at panic+0x43
    openzfs#3 0xffffffff808adc15 at trap_fatal+0x385
    openzfs#4 0xffffffff808adc6f at trap_pfault+0x4f
    openzfs#5 0xffffffff80886da8 at calltrap+0x8
    openzfs#6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    openzfs#1 0xffffffff8059620f at vpanic+0x17f
    openzfs#2 0xffffffff81a27f4a at spl_panic+0x3a
    openzfs#3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    openzfs#4 0xffffffff8066fdee at vinactivef+0xde
    openzfs#5 0xffffffff80670b8a at vgonel+0x1ea
    openzfs#6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

2 participants