-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel Panic 2.0.6 every couple hours (and 2.1.1) #12634
Comments
What's |
Datastore isnt used much, but the others are heavily used
|
Has this happened more than once? If so, are the panics any different other times? |
its happening pretty consistently: here is another one, and ill load them as they come.
|
Can you try 2.1.1 at some point, just as another data point? I don't recall any fixes in 2.1.X that corrected something that looked like this, offhand, but I am far from an omniscient oracle. |
I actually downgraded from 2.1.x to this due to the same type of issues. |
Has this always happened since you set the system up (I presume recently, in that case) or is this a recent development in a longer-running system? |
The system has been up for 2 years. I have recently been trying to use zfs on it. Ive been using ZFS for 2 months, and in the past month, this has been an issue. |
as soon as I get the tank pool to get from a non-degraded state, ill upgrade again and see how that goes |
2.1.1-1:
|
it has crashed 2 more times since this, but its not capturing the dump in syslog. I have a theory, in Unraid, we have 2 virtualization technologies that are being used, Docker, and KVM (for virtual machines). I have both disabled currently, and Im not getting crashes. I enabled KVM and I get a crash within 10 minutes or so (reboot, crashes again, rinse/repeat). I wonder if that is what is going on the vDisks are on the /fast/vms dataset. these are 4 nvme drives in raidz |
That's quite curious. I haven't heard a wave of people screaming about kvm being on fire. Is ZFS inside the KVM instance(s), or "just" providing the block storage? What driver(s) are KVM using for the zvols - virtio of some flavor, or ? |
its just providing block storage. KVM is using virtio. Im not super knowledgeable in this realm. I heard that nvme based vdevs are finicky and thought maybe the 2 are crying... I have KVM up now, but with no VMs running and its fine. |
Oh boy. virtio-blk or virtio-scsi? I don't expect NVMe to be particularly exciting here. |
|
KVM and Docker arent running at all.....
|
That seems very, very mad - like, trying to run random things that are not executable code mad, which makes me think this might not be an issue specific to OpenZFS on this system... It's certainly not impossible, but those are pretty unusual failure modes. There's been #11937, where it was trying to run illegal AVX512 opcodes because it assumed everybody with AVX512 at all had all of them, or something, or #12543, where it seems to be a very reproducible and specific crash under Docker, but yours just seems to be crashing anywhere and everywhere now, which...seems like a very bad issue. Can you dig out the very, very start of this series of nightmarish errors in the logs? Because the rest seem to be similarly of the form "something in-memory is pointing to somewhere insane so whenever I try running it fire ensues", but I'm curious if there's anything more informative when it first goes mad. I'd also be kind of curious to see if kASAN (which would require recompiling your whole kernel) would spot anything exciting. |
Here is the whole thing (starting at the madness) |
on a side note, I would be willing to do what is needed to get this working. My linux skills are intermediate. |
If it were me, I'd probably try booting a distro that other people can readily reproduce problems with (or not) like Debian/Ubuntu/Fedora/CentOS/..., and see if your problems persist. If so, great. If not, that sounds like an issue to raise with Unraid. (Note that I'm not expecting you to have trouble reproducing them unless Unraid does truly exciting things in their kernel, just explaining the flowchart I'd be working through...) |
getting Ubuntu 20.04.3 LTS on a USB and live boot it... |
ok, installed Ubuntu Server 21.04 with zfs 2.0.2. scrubbing tank, mounted samba shares. what else should I do to try and cause the issue? |
I'd probably try vanilla 2.0.6/2.1.1 so we're comparing apples to apples, but just using the system like you were normally would be best - whether that means with docker and kvm or whatever, so be it. You could try triggering a scrub if you're out of other ways to load it down. |
i would like to upgrade to 2.1.1..... dont know how |
i got the storm of stack traces without docker or KVM running. and with 2 different kernels. |
root@ubuntu:/datastore# zfs version |
Just build and install it, in particular be aware of the Ubuntu workaround required? How were you changing the version on Unraid? |
I cant say with any confidence that I have any idea what you said lol. |
The checksum errors include metadata about which parts were wrong. I was remarking that I found it curious there seemed to be some overlap between the regions and some regions that were always fine, on different checksum errors, but that since there were so few possibilities and so few samples to begin with, it could just be a coincidence. |
is this something that I need to worry about? if I put this b
can I just clear? |
They're corrected, so you can clear them. But if you scrub again immediately and find more, you should probably be concerned. In general, if you get checksum errors and don't know where they're from, you should figure it out - they're not supposed to randomly crop up. |
ive scrubbed and cleared datastore, the checksums went to 0, now they are back:
|
Is the system actually dead at that point, or are those just "hey it took too long but things are running"? |
its dead. cant ssh, cant manually log in have to press the power button |
but its scrolling like crazy |
I wonder if it's managed to log anything from the start or just locked up too hard... |
|
That seems to be from booting the non-kASAN kernel. What does, say, |
|
assuming that's the whole thing, that's still not the right kernel, and incredibly truncated. Can you try -b -2, -b -3, and so on, until you find one with whatever you named the kasan kernel, not 5.11.0-37-generic, at the top? |
5.11 is the kasan kernel. I tried 5.13 but had hell with it and went with a different version its not the whole thing. Im having trouble getting the whole thing (putty buffer I guess) |
Oh, you just...kept the Ubuntu name suffix. Okay. That seems pretty odd that the prior kernel doesn't mention anything about KASAN before NULL dereferencing deep in the non-ZFS IO stack. Could you share the .config you used? |
im not sure what happened, but CONFIG_KASAN got commented out between uncommenting it, and building... ill fix it and come back to this. |
ok, after fighting with it this weekend, I finally got 5.15.0-rc5 compiled and loaded. quick question before I start stress testing again, do i need to reinstall ZFS now that I have the new kernel?
|
You could try it with the 2.1.1 version it got from the PPA-provided zfs-dkms packages, I suppose, since it seems to have built and installed. Not sure why you opted to jump for an -rc vanilla kernel, but ¯_(ツ)_/¯. |
I was trying everything.... I honestly have about 60% grasp on what I am doing and spent the whole weekend learning about compiling kernels. Im about to stress test again, but I want to ensure that Im running the correct version and debug version of things. can you assist with that? what am I looking for to verify KASAN is working and the zfs version has debug enabled? |
If you want to see if the module was built with --enable-debug, in dmesg, it'll print something like this on load: For me, in order to make my DKMS modules build with that, I have in /etc/sysconfig/zfs the line: If you want to go rebuild the existing 2.1.1 DKMS module after setting that, I think you get to do If you want to go build+install from git, these instructions work well, and if they don't or are unclear somewhere, feel free to ask. |
|
|
ok, so I have a theory. I think the issue only presents itself when a scrub is going (where there are legitimate errors) and there is heavy IOPS. Since my 'tank' pool finally finished scrubbing, I have had no issues. I went back to the original setup and have been running it for 24 hours. no issues. I think the fact that a docker image caused issues (panics) that, in turn, caused my pools to lose sync, It became a perpetual issue. I will continue to monitor and if the issue presents itself again, Ill hopefully will have the skills to troubleshoot/debug/report in place, instead of using a new distro, fumble about the kernel configs and compiling, and not know how to load/install a module |
I would suggest experimenting with Fedora 34 (officially supported repos on the docs), and running https://www.memtest86.com just to rule that out. |
Trying Fedora for another data point wouldn't be harmful, but if it also reproduces on ordinary Ubuntu I wouldn't expect any additional information. memtest isn't an unreasonable experiment. I thought it had been run already here, but apparently that was some other issue. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
Kernel Panics. Im not sure what is causing it.
Describe how to reproduce the problem
Daily Usage, migrating large documents.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: