-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY(HDR_EMPTY(hdr)) failed PANIC at arc.c:6787:arc_release() #12020
Comments
It happened again~
|
Well, there goes my "caused by server under heavy load" theory... (It just happened right after a reboot, where the only thing running was an "apt install [ZFS 2.0.3]")
|
...oh boy, this time on 2.0.X! (Specifically, 2.0.3-1~bpo10+1) I'm pretty tempted to blame NFS for this, since using NFS aggressively is the only thing that's really changed recently...
|
...well, uh, now my machine panics within 5 minutes of booting, even if I don't actively run anything on it, so I'm pretty confident this is an interaction with NFS. |
This didn't turn up for a long time, ever since I stopped having my Linux/sparc64 system chroot over NFS. Then tonight, with no sparc64 NFS consumer in sight...
edit: Well, it happened while the machine reported no NFS clients connected, so it's not solely an NFS complication. Maybe it's just whoever tries to access the buffer first after whatever problem creates it? It's odd though, because I can go months without seeing it, and then see it a couple times in a short interval. I'd blame my RAM, but ECC hasn't kicked up even a single bit error on this system. So I guess heavily workload-dependent? I was about to try reproducing this on a testbed in a VM when the most recent one happened, so I'll just go back to doing that. |
Oh boy oh boy, I was very nice, and for my birthday I got a kdump of this happening in a VM and not my actual server. I still really want to know why I can reproduce this consistently by doing very little IO from my little Netra T1 inside a chroot on NFS, but a whole chroot debootstrap and Linux kernel build inside it from an x86_64 host doesn't make it blink... Presumably I could sniff all the traffic from the sparc64 system doing it and an x86_64 system using qemu-user-static to do it and see what differs. If anyone's any better at
|
Fascinatingly, afa7b34 repros this (using the sparc64 NFS method) on buster (4.19.0-16-amd64), but not bullseye (5.10.0-7-amd64) or Fedora 34 (5.12.9-300.fc34.x86_64). (I'll eventually try the "VM with virtio" method, but since that involves nesting virtualization, that's potentially a deep can of worms.) Coincidence? Something subtly wrong in the NFSv4 codepath that got fixed? Compiler doing something differently? Who knows! But if it's reachable by two different routes, it seems feasible that something could be going wrong inside OpenZFS, not just because something is misusing a function. Now that i have it reliably reproducing in a VM, I'll explore that... |
Well, after some poking around, I just took this suggestion and commented the failing VERIFY out. (gulp) So far, the testbed I've done this in hasn't caught fire in any noticable way, and I know it's hit the changed behavior because I added an if(!HDR_EMPTY(hdr)) { zfs_dbgmsg(...); } in its place. So after I do a bunch of writing over NFS, I'll move onto the other way I know to trigger this - kvm with virtio. We'll see how well x86 nested virtualization works... |
Removing the What's critical is that the following are true for the anonymous buffer (all of which are ASSERTed).
@rincebrain as long as nothing odd shows up in your testing I'd suggest opening a PR which drops the ASSERT and we'll get another set of eyes on it. |
Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. Let's not, and say we did. Closes: openzfs#9897 Closes: openzfs#12020 Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#9897 Closes openzfs#12020 Closes openzfs#12246
Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#9897 Closes openzfs#12020 Closes openzfs#12246
Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#9897 Closes openzfs#12020 Closes openzfs#12246
Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#9897 Closes openzfs#12020 Closes openzfs#12246
(This might be the same as #9897, but since the stack seemed entirely different, I figured I'd file it separately.)
System information
Describe the problem you're observing
While using my headless server with a couple of "long-running" (a day or so) kvm VMs, I got an exciting PANIC in my console, not obviously related to anything I was doing at the moment.
Describe how to reproduce the problem
???
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: