Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic on FreeBSD 14-CURRENT w/slog_replay_fs_001 #12163

Closed
rincebrain opened this issue May 30, 2021 · 11 comments
Closed

panic on FreeBSD 14-CURRENT w/slog_replay_fs_001 #12163

rincebrain opened this issue May 30, 2021 · 11 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rincebrain
Copy link
Contributor

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 14-CURRENT
FreeBSD Kernel main-n246975-8d5c7813061
Architecture x86_64
ZFS Version d484a72

Describe the problem you're observing

I was trying to reproduce the 3 consistent FBSD 14-CURRENT test failures:

    FAIL cli_root/zdb/zdb_checksum (expected PASS)
    FAIL cli_root/zdb/zdb_objset_id (expected PASS)
    FAIL slog/slog_replay_fs_001 (expected PASS)

The first two were trivial to workaround and I'll have a PR shortly. The third one looked more complicated when I tried it with the "stock" version it shipped with (3522f57), and I noticed it logged the following to syslog but didn't hang the test or the greater system:

May 30 00:34:13 fbsd14test kernel: got error -2 on name b on op 3
May 30 00:34:13 fbsd14test kernel: KDB: stack backtrace:
May 30 00:34:13 fbsd14test kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e03cae40
May 30 00:34:13 fbsd14test kernel: zfs_lookup_internal() at zfs_lookup_internal+0xec/frame 0xfffffe00e03caea0
May 30 00:34:13 fbsd14test kernel: zfs_rename() at zfs_rename+0xf9/frame 0xfffffe00e03caf90
May 30 00:34:13 fbsd14test kernel: zfs_replay_rename() at zfs_replay_rename+0x8b/frame 0xfffffe00e03cafe0
May 30 00:34:13 fbsd14test kernel: zil_replay_log_record() at zil_replay_log_record+0x21a/frame 0xfffffe00e03cb130
May 30 00:34:13 fbsd14test kernel: zil_parse() at zil_parse+0x5e0/frame 0xfffffe00e03cb340
May 30 00:34:13 fbsd14test kernel: zil_replay() at zil_replay+0xd5/frame 0xfffffe00e03cb3a0
May 30 00:34:13 fbsd14test kernel: zfsvfs_setup() at zfsvfs_setup+0x24d/frame 0xfffffe00e03cb5d0
May 30 00:34:13 fbsd14test kernel: zfs_mount() at zfs_mount+0x66f/frame 0xfffffe00e03cb770
May 30 00:34:13 fbsd14test kernel: vfs_domount() at vfs_domount+0x8a0/frame 0xfffffe00e03cb9e0
May 30 00:34:13 fbsd14test kernel: vfs_donmount() at vfs_donmount+0x872/frame 0xfffffe00e03cba80
May 30 00:34:13 fbsd14test kernel: sys_nmount() at sys_nmount+0x69/frame 0xfffffe00e03cbac0
May 30 00:34:13 fbsd14test kernel: amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe00e03cbbf0
May 30 00:34:13 fbsd14test kernel: fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e03cbbf0

So I installed latest OpenZFS git right now, modified loader.conf to have zfs_load="NO" openzfs_load="YES", rebooted, ran the test again...very shortly after launching the test, I noticed my sessions had hung, and the local console had this:
VirtualBox_FreeBSD 14 testbed_30_05_2021_00_45_21

(For later searching purposes: panic: namei: repeated call to namei without NDREINIT is the top of it)

I managed to save a core dump, and can provide that on request if anyone's interested.

Describe how to reproduce the problem

Run slog_replay_fs_001 on this kernel version and OpenZFS version, apparently.

Include any warning/errors/backtraces from the system logs

See above.

@rincebrain rincebrain added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 30, 2021
@rincebrain
Copy link
Contributor Author

rincebrain commented May 30, 2021

Assuming the test failure and this are correlated, it looks like build 651 was the first one on the bot(s) to have those three fail - and it has the aforementioned trace in the syslog. Unfortunately, I haven't seen an obvious way to get the FBSD revision that was running at the time, but since snapshots seem to be weekly, that rather limits the options.

Since the "stock" revision of OZFS it came with (2.1-rc1) is from 3/29 and does this, I suspect that it's a change on the FBSD side that resulted in this.

edit: I didn't see anything obviously relevant in the commits to FBSD in the last month, so I tried rolling back OpenZFS to 93f81eb (because any further would require me to cherrypick 93f81eb in order to build), to be certain it's probably not a change in OpenZFS - same failure modes.

So I guess it's FreeBSD bisect time, unless someone has better insight than me.

@rincebrain
Copy link
Contributor Author

rincebrain commented May 30, 2021

Huh.

So I hopped a bit into the past for FBSD (n245865-13b3862ee874, so around 4/6), tried this with the packed zfs.ko (891568c) (which would have been built with DEBUG), I got the aforementioned message and backtrace, but the test passed. Tried with git master (d484a72) and no debug, and a panic popped out. Tried same git rev with debug, it likewise passed slog_replay_fs_001 and printed the aforementioned backtrace.

I wonder if it'd be better to build FBSD on the testbots without --enable-debug to catch things like this...

@behlendorf
Copy link
Contributor

Based on the panic from the latest OpenZFS master I'd suspect that could have been introduced by #11997. It was only applied to master on 5/13, so it wouldn't have been in the the earlier versions you were testing. This was the panic: namei: panic.

The other slog_replay_fs_001 stack looks like a different problem to me, and is a bit suspicious because clearly some debug code was left in place to dump a backtrace on failure here. The other odd thing is it logged a -2 error code, and I wouldn't have expected to see any negative error codes in the FreeBSD call paths. Figuring out where that -ENOENT is coming would probably be insightful.

It'd be great to open PRs for at least for those first two easy test case fixes.

cc: @freqlabs @amotin

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label May 30, 2021
@rincebrain
Copy link
Contributor Author

rincebrain commented May 30, 2021

Easy enough to test.

I'll go open the other two fixes, though I'm not sure if they're just masking some deeper problem...I'll write that down in the PR.

edit: #12165

@rincebrain
Copy link
Contributor Author

Oh right, I tested with 93f81eb which, according to my earlier post, still panics, and that seems to be before 210231e landed.

I'll go test with 210231e~1 once my latest build{kernel,world} finishes, to be certain I didn't just take the aforementioned debug backtrace as a proxy and write it down incorrectly, but I don't think I did.

@rincebrain
Copy link
Contributor Author

rincebrain commented May 30, 2021

I just tested with d86debf (which is, unless I'm really bad at git, the commit immediately prior to 210231e), and...still panics!

VirtualBox_FreeBSD 14 testbed_30_05_2021_15_16_06

(I explicitly checked the version of the kernel module and userland before running the test, so I'm reasonably confident this is not a false report.)

So I'm going to believe my earlier comment that back to 93f81eb still panics, and go back to trying older FBSD kernels.

edit to add: n245179-95331c228a39 (3/1 or so) + 5ad86e9 still burns down. (Had to do 5ad86e9 or earlier because apparently git master dies with implicit definition of various vfsops functions after 93f81eb on older FBSD git)

@rincebrain
Copy link
Contributor Author

...huh.

0e9bcd5 + freebsd/freebsd-src@95331c228a39 has this panic, and the commit in 0e9bcd5 is needed to build against the recent system. So I can't really wind further back on that, but it seems like it's been with us for some time, and winding OpenZFS back won't help find it.

(If you're wondering why I tried despite my earlier statement - building kernels takes much longer than just winding OpenZFS back, so I figured I'd at least try going back as far as possible.)

As for why nobody's run across it - it seems to require you be running a kernel with as much debugging as "GENERIC" (it doesn't reproduce with GENERIC-NODEBUG, which may be expected for FreeBSD regulars, but certainly surprised me), and an OpenZFS module without --enable-debug (which is, I believe, why none of the "stock" zfs modules bundled with FBSD base, or built by the buildbots, ever did this).

Is the right thing to do to detect when you're attempting to build against a FBSD kernel with debug bits and force them on/error without --enable-debug passed? I'm not sure; I have no insight into whether this message is spurious, yet, but it feels somewhat hackish, unless FreeBSD outright says you're required to not do that. (If this takes a nontrivial interval, I'll probably open a PR for it anyway, at least for now, so people can't accidentally get burned.)

Meanwhile, I'm going to go back to trying older and older kernels.

@rincebrain
Copy link
Contributor Author

Okay, I went all the way back to freebsd/freebsd-src@0f34c80 and it still panicked; I tried freebsd/freebsd-src@b58a463 and it failed to boot from my ZFS root with either the builtin zfs.ko or 5ad86e9, so I decided to stop trying that rather than reinstalling my testbed with UFS root to continue.

I examined the code long enough to confirm one theory I had (that it was calling namei() twice due to the "restart" gotos) was false, then I looked at other NDINIT_* callers, and couldn't find anything obviously wrong with the usage.

So for now, I'm going to go try bisecting the incorrect behavior of the test case I started this looking into (compiled with --enable-debug --enable-debuginfo, obviously), and maybe I'll come back to this later - it can't be especially important, if nobody else has reported it prior, and it's been present since at least January.

@ghost
Copy link

ghost commented Jun 1, 2021

The namei errors I believe are mismatched INVARIANTS build options.

@rincebrain
Copy link
Contributor Author

rincebrain commented Jun 1, 2021

The namei errors I believe are mismatched INVARIANTS build options.

How odd - if so, I think something is wrong, because there is configure code to notice this and set things appropriately, and my config.log without --enable-debug --enable-debuginfo reports WITH_INVARIANTS='true'

So I'll experiment with this and see what I find. Thanks!

edit: So I just tried --enable-invariants, in case it behaved differently than "detect" or I was wrong, and it still panics the same way. So I guess --enable-debug is required, but I'm still curious why it's only breaking on 14-CURRENT.

Maybe I'll try my hand at a patch to notice DEBUG...

@ghost
Copy link

ghost commented Jun 1, 2021

Just a hunch, we might need to make sure WITH_INVARIANTS implies WITH_DEBUG through either the configure logic or the Makefile.

rincebrain added a commit to rincebrain/zfs that referenced this issue Jun 3, 2021
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.

Closes: openzfs#12163

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
rincebrain added a commit to rincebrain/zfs that referenced this issue Jun 3, 2021
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.

Closes: openzfs#12163

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jun 8, 2021
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#12185
Closes openzfs#12163
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jun 8, 2021
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#12185
Closes openzfs#12163
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jun 9, 2021
There's already logic to force INVARIANTS on for building if it's
present in the running kernel; however, not having DEBUG enabled
when DEBUG and INVARIANTS are can cause strange panics.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#12185
Closes openzfs#12163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants