Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle system, no obvious cause: PANIC at zfs_znode.c:620:zfs_znode_alloc(), VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0) #12768

Closed
Kayvlim opened this issue Nov 15, 2021 · 5 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Kayvlim
Copy link
Contributor

Kayvlim commented Nov 15, 2021

This issue seems to be similar to #11648 and #9741, but I chose to open a new issue because in my case there is no recv going on.

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version (rolling, last full update yesterday 14/11/2021)
Kernel Version 5.15.2-arch1-1
Architecture x86_64
OpenZFS Version zfs-2.1.1-1

uname -a: Linux $hostname 5.15.2-arch1-1 #1 SMP PREEMPT Fri, 12 Nov 2021 19:22:10 +0000 x86_64 GNU/Linux

ZFS-related packages installed via the archzfs repository:

archzfs  zfs-linux  2.1.1_5.15.2.arch1.1-1  [installed]
archzfs  zfs-utils  2.1.1-1                 [installed]

Describe the problem you're observing

This is my laptop workstation.

Chronology:

  • Worked on the computer in the morning
  • Locked screen for lunch (at 13:03)
  • Had long lunch, during which (at 13:20) a crash was logged by dmesg.
  • Came back (at 15:00) to a black screen and unresponsive system

Symptoms

  • Black screen, mouse can move (KDE/Plasma). Can't remember if I entered the lock screen password or if no password prompt to unlock the computer was provided.
  • Attempted to login with regular user via tty2; failed login shows error; successful login hangs before prompt.

Diagnosis

  • Logged in with root user via tty3. Opened tmux, expecting some commands to hang while keeping a working tty.
  • htop shows a load of 300 and a kernel thread txg_quiesce in state D.
  • Several other processes are in state D, including the zsh of tty2 and the 13:30 auto-snapshot by sanoid (this zfs snapshot process never finished, but was spawned after the hang, so it is not a cause but a consequence).
  • Running sudo -u myuser strace zsh and same for bash show that both shells hang when attempting to interact with the history files in the home directory. The system root /, the root user's homedir /root, and the regular user's homedir /home/user (under dataset /home ) are all in different datasets. Only /home/user seems to be affected.
  • ls -l /home works fine.
  • ls [-l] /home/user hangs forever.
  • ls -l /home/user/.ssh/ works fine ... !!!
  • ls -l /home/.zfs/snapshot/autosnap_..../user works fine.
  • Logs were collected and stored in /root and copied to an external device. After rebooting the system, every file created under /root after the crash was gone (including files and directories created at 15:10). I assumed the state of the rebooted system is the state of the system in the txg before the crash, but the systemd journal (in /var/log, another dataset of this pool) seems to be complete with information up until 15:30. I have no idea what to take from this.
  • zfs send pool/hostname/DATA/home@autosnap_....... > /media/external/file_home worked fine, and I was able to keep a copy of the most recent snapshot of this dataset.
  • zfs send pool/hostname/DATA/root > /media/external/file_root hanged. It was expected to fail with error target is busy; if a filesystem, it must not be mounted, and it's not even the dataset with apparent issues, but it hanged nonetheless.

Possibly relevant

  • The entire system is in this zpool, including the root filesystem
  • ZFS Native Encryption is enabled, and different sets of datasets are using different keys.
  • At the time of the hang, no send, recv or even snapshot operations were ongoing.
  • zfs-linux was updated yesterday to match the kernel 5.15.2:
    [2021-11-14T04:37:10+0000] [ALPM] upgraded zfs-linux (2.1.1_5.14.14.arch1.1-1 -> 2.1.1_5.15.2.arch1.1-1)
  • zfs rollback was executed for the first time on this pool yesterday afternoon, on the home dataset (the same one that hanged today), and the system was not rebooted between that rollback and the crash:
    2021-11-14.16:18:44 zfs rollback -r pool/hostname/DATA/home@autosnap_2021-11-14_16:00:00_frequently

Describe how to reproduce the problem

This is the first time it happens. I don't know how to reproduce.

Include any warning/errors/backtraces from the system logs

dmesg

Nov 15 13:20:05 hostname kernel: VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0)
Nov 15 13:20:05 hostname kernel: PANIC at zfs_znode.c:620:zfs_znode_alloc()
Nov 15 13:20:05 hostname kernel: Showing stack for process 698277
Nov 15 13:20:05 hostname kernel: CPU: 0 PID: 698277 Comm: ThreadPoolForeg Tainted: P        W  OE     5.15.2-arch1-1 #1 e3bfbeb633edc604ba956e06f24d5659e31c294f
Nov 15 13:20:05 hostname kernel: Hardware name: Micro-Star International Co., Ltd. XXXX XXX/XX-XXXX, BIOS XXXXXXXX.XXX XX/XX/2014
Nov 15 13:20:05 hostname kernel: Call Trace:
Nov 15 13:20:05 hostname kernel:  dump_stack_lvl+0x46/0x62
Nov 15 13:20:05 hostname kernel:  spl_panic+0xea/0x102 [spl e8217401e849d7dbfbbf4fee11d4c276c980afbe]
Nov 15 13:20:05 hostname kernel:  ? iput+0x6a/0x1f0
Nov 15 13:20:05 hostname kernel:  zfs_znode_alloc+0x6b0/0x7f0 [zfs bd0dc253810420ba0e229ec08eb7ad34ea0ee81a]
Nov 15 13:20:05 hostname kernel:  zfs_mknode+0x89e/0xfa0 [zfs bd0dc253810420ba0e229ec08eb7ad34ea0ee81a]
Nov 15 13:20:05 hostname kernel:  zfs_create+0x691/0x9a0 [zfs bd0dc253810420ba0e229ec08eb7ad34ea0ee81a]
Nov 15 13:20:05 hostname kernel:  zpl_create+0xb6/0x1d0 [zfs bd0dc253810420ba0e229ec08eb7ad34ea0ee81a]
Nov 15 13:20:05 hostname kernel:  path_openat+0xde9/0x1030
Nov 15 13:20:05 hostname kernel:  do_filp_open+0xa5/0x150
Nov 15 13:20:05 hostname kernel:  do_sys_openat2+0xc4/0x190
Nov 15 13:20:05 hostname kernel:  __x64_sys_openat+0x53/0xa0
Nov 15 13:20:05 hostname kernel:  do_syscall_64+0x5c/0x90
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  ? syscall_exit_to_user_mode+0x23/0x50
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  ? do_syscall_64+0x69/0x90
Nov 15 13:20:05 hostname kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Nov 15 13:20:05 hostname kernel: RIP: 0033:0x7ff09b194634
Nov 15 13:20:05 hostname kernel: Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 16 59 f9 ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 89 44 24 0c e8 68 59 f9 ff 8b 44
Nov 15 13:20:05 hostname kernel: RSP: 002b:00007ff0950468f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Nov 15 13:20:05 hostname kernel: RAX: ffffffffffffffda RBX: 00007ff07c72a2db RCX: 00007ff09b194634
Nov 15 13:20:05 hostname kernel: RDX: 00000000000000c2 RSI: 00007ff07c72a270 RDI: 00000000ffffff9c
Nov 15 13:20:05 hostname kernel: RBP: 00007ff07c72a270 R08: 0000000000000000 R09: 0000000000000001
Nov 15 13:20:05 hostname kernel: R10: 0000000000000180 R11: 0000000000000293 R12: 00000000000000c2
Nov 15 13:20:05 hostname kernel: R13: 00007ff07c72a2db R14: 00007ff09b234640 R15: 0000000004119462

Pool features

zpool get all pool | grep feature | sort -k3

pool  feature@encryption             active                         local
pool  feature@hole_birth             active                         local
pool  feature@empty_bpobj            active                         local
pool  feature@enabled_txg            active                         local
pool  feature@spacemap_v2            active                         local
pool  feature@lz4_compress           active                         local
pool  feature@embedded_data          active                         local
pool  feature@project_quota          active                         local
pool  feature@extensible_dataset     active                         local
pool  feature@spacemap_histogram     active                         local
pool  feature@draid                  disabled                       local
pool  feature@livelist               disabled                       local
pool  feature@log_spacemap           disabled                       local
pool  feature@zstd_compress          disabled                       local
pool  feature@device_rebuild         disabled                       local
pool  feature@bookmark_written       disabled                       local
pool  feature@redacted_datasets      disabled                       local
pool  feature@userobj_accounting     disabled                       local
pool  feature@redaction_bookmarks    disabled                       local
pool  feature@edonr                  enabled                        local
pool  feature@skein                  enabled                        local
pool  feature@sha512                 enabled                        local
pool  feature@bookmarks              enabled                        local
pool  feature@bookmark_v2            enabled                        local
pool  feature@large_dnode            enabled                        local
pool  feature@large_blocks           enabled                        local
pool  feature@async_destroy          enabled                        local
pool  feature@device_removal         enabled                        local
pool  feature@resilver_defer         enabled                        local
pool  feature@obsolete_counts        enabled                        local
pool  feature@zpool_checkpoint       enabled                        local
pool  feature@filesystem_limits      enabled                        local
pool  feature@allocation_classes     enabled                        local
pool  feature@multi_vdev_crash_dump  enabled                        local

More information

More information is available on request. Glancing over it, I can't see anything special going on there, so I did not include it in this report, since it's hard to sanitize.

Available information includes ps -axuf, pacman -Qi, pacman -Sl, the contents of /proc/spl/kstat/zfs, including dbgmsg, and the output of the following sysrq commands: show-backtrace-all-active-cpus(l) show-memory-usage(m) show-registers(p) show-all-timers(q) show-task-states(t) show-blocked-tasks(w) dump-ftrace-buffer(z).

@Kayvlim Kayvlim added the Type: Defect Incorrect behavior (e.g. crash, hang) label Nov 15, 2021
@h1z1
Copy link

h1z1 commented Dec 16, 2021

I believe the assert you hit was changed after so possibly still related. Not really sure why it was added at all tbh.

@PaulZ-98
Copy link
Contributor

PaulZ-98 commented Jan 3, 2022

Is the process 698277 (with the panic) trying to open one of the files that would have been affected by the rollback? During rollback the file inode(s) are released (via iput) asynchronously and when that is slow, we've seen variations on this problem.

@Kayvlim
Copy link
Contributor Author

Kayvlim commented Jan 4, 2022

@PaulZ-98: pid 698277 is ThreadPoolForeg. It doesn't show in any of my logs, so by the time I collected them, it was already gone, as was its parent 68016:

task:ThreadPoolForeg state:D stack:    0 pid:698277 ppid: 68016 flags:0x00000000`

Some research turns up ThreadPoolForegroundWorker, a Chromium task. Chromium was indeed running at the time, and it was almost certainly running since before the rollback, so your hypothesis holds up: there's a very good chance Chromium was holding files in /home since before the rollback.

@stale
Copy link

stale bot commented Jan 5, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jan 5, 2023
@stale stale bot closed this as completed Apr 7, 2023
@duramuss
Copy link

* `zfs rollback` was executed for the first time on this pool **yesterday** afternoon, on the `home` dataset (the same one that hanged today), and the system was not rebooted between that rollback and the crash:
  `2021-11-14.16:18:44 zfs rollback -r pool/hostname/DATA/home@autosnap_2021-11-14_16:00:00_frequently`

This is the cause the crash.
I got this yesterday. txg_quiesce process and all processes which tried to write after it were in D state (except journald, don't know why).
Similar dmesg than OP:

nov. 25 21:05:33 duranux2 kernel: VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0)
nov. 25 21:05:33 duranux2 kernel: PANIC at zfs_znode.c:627:zfs_znode_alloc()
nov. 25 21:05:33 duranux2 kernel: Showing stack for process 2208457
nov. 25 21:05:33 duranux2 kernel: CPU: 1 PID: 2208457 Comm: bash Tainted: P        W  O       6.4.16 #1
nov. 25 21:05:33 duranux2 kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
nov. 25 21:05:33 duranux2 kernel: Call Trace:
nov. 25 21:05:33 duranux2 kernel:  <TASK>
nov. 25 21:05:33 duranux2 kernel:  dump_stack+0x3a/0x50
nov. 25 21:05:33 duranux2 kernel:  spl_panic+0xbf/0xe0 [spl]
nov. 25 21:05:33 duranux2 kernel:  ? kmem_cache_alloc+0x38/0x1a0
nov. 25 21:05:33 duranux2 kernel:  ? zfs_rangelock_init+0x50/0x60 [zfs]
nov. 25 21:05:33 duranux2 kernel:  ? zfs_znode_cache_constructor+0x101/0x140 [zfs]
nov. 25 21:05:33 duranux2 kernel:  ? spl_kmem_cache_alloc+0xa3/0x610 [spl]
nov. 25 21:05:33 duranux2 kernel:  ? iput.part.0+0x4e/0x1c0
nov. 25 21:05:33 duranux2 kernel:  zfs_znode_alloc+0x74b/0x7d0 [zfs]
nov. 25 21:05:33 duranux2 kernel:  zfs_mknode+0x663/0xec0 [zfs]
nov. 25 21:05:33 duranux2 kernel:  zfs_create+0x6b5/0x940 [zfs]
nov. 25 21:05:33 duranux2 kernel:  zpl_create+0xcb/0x1e0 [zfs]
nov. 25 21:05:33 duranux2 kernel:  path_openat+0xee3/0x11c0
nov. 25 21:05:33 duranux2 kernel:  do_filp_open+0xa0/0x130
nov. 25 21:05:33 duranux2 kernel:  ? __check_object_size+0x14b/0x260
nov. 25 21:05:33 duranux2 kernel:  __x64_sys_openat+0x18f/0x270
nov. 25 21:05:33 duranux2 kernel:  do_syscall_64+0x60/0x90
nov. 25 21:05:33 duranux2 kernel:  ? exc_page_fault+0x395/0x950
nov. 25 21:05:33 duranux2 kernel:  ? do_syscall_64+0x6c/0x90
nov. 25 21:05:33 duranux2 kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0xb5
nov. 25 21:05:33 duranux2 kernel: RIP: 0033:0x7f41a966f25e
nov. 25 21:05:33 duranux2 kernel: Code: 83 e2 40 75 4f 89 f0 f7 d0 a9 00 00 41 00 74 44 80 3d b5 92 13 00 00 74 68 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8e 00 00 00 48 8b 54 24 28 64 48 2b 14 25
nov. 25 21:05:33 duranux2 kernel: RSP: 002b:00007ffe6a747770 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
nov. 25 21:05:33 duranux2 kernel: RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007f41a966f25e
nov. 25 21:05:33 duranux2 kernel: RDX: 0000000000000241 RSI: 000055a51c8aa790 RDI: 00000000ffffff9c
nov. 25 21:05:33 duranux2 kernel: RBP: 000055a51c8aa790 R08: 0000000000000000 R09: 0000000000000020
nov. 25 21:05:33 duranux2 kernel: R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000000
nov. 25 21:05:33 duranux2 kernel: R13: 0000000000000001 R14: 000055a51c8aa790 R15: 000055a51aac9320
nov. 25 21:05:33 duranux2 kernel:  </TASK>
nov. 25 21:10:11 duranux2 fcron[975273]: Job 'xxx' completed

Around 20:40 the same day I performed two rollbacks on two filesystems without unmounting them first to return to a clean state before restarting compiles jobs because the pool was full around 20:35 and some files were truncated. In the past, zfs rollback would hang occasionally on a filesystem without unmounting it first.

uname -a
Linux duranux2 6.4.16 #1 SMP PREEMPT_DYNAMIC Fri Nov 10 14:26:34 CET 2023 x86_64 GNU/Linux
zfs --version
zfs-2.1.13-1
zfs-kmod-2.1.13-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants