-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel hang on pool scrub Ubuntu 12.04, zfs 0.6.1 using PPA and git HEAD #1515
Comments
Cloned both the spl and zfs repo's and after replacing the Ubuntu 0.6.1 PPA debs got the same results (system freeze followed by nothing in the logs after forced reboot) by pounding on the ZVOL using CrystalDiskMark for windows. The iSCSI target was LIO. ZFS and SPL v0.6.1-34_g0377189 Very interested in how I can trick this situation into providing debug data. |
It is possible that your hardware is failing to respond, which could cause this. cc92e9d introduced a deadman thread that will cause a kernel panic upon detection of such failures. It should be in the daily PPA. Would you give that a try and report back? If the deadman thread induces a panic, there is an issue with your hardware. |
Installed the PPA daily for precise however it looks like the packages are identical to those of the PPA stable. Went back to the repo's I used for testing with the built from source debs:
So if it was the hardware hanging it should have pinged when running zfs/HEAD and spl/HEAD. Is it worth me setting up a serial console for the kernel messages? I could also rebuild the kernel for immediate printk I suppose. I shall go and play with putting all the disks onto one controller and seeing if I can break it that way in case there is a difference of opinion between the mptsas controller and the onboard sata_nv. |
Rather than go the trouble of setting up a kernel watchdog and serial terminal I went and installed Centos 6 and the zfs 0.6.1 packages. I have been hammering it two bonnie++'s in async while loops and repeated scrubs. No lockups at all. This data point suggests either a regression in the drivers on kernel 3.2 or there is a bug with zfs on Ubuntu. Where should I go next (although I am tempted to close with -ECENTOSWORKSFORME :) ) |
Issue #1482 was actually observed on Centos 6/zfs 0.6.1 with open vz on top of that
The system has tons (128GB) of memory but it's ECC plus we've run memtest86+ for a while when setting it up with no errors reported. So, I'm guessing, intermittent memory problems don't seem to the culprit. Two series of 3 lockups each happened twice, each time under quite heavy load due to about 30 remote mysql clients sending about 2TB of data into the server set up in one of the openvz containers; it seems there were no adverse effects - the box seemed to be operating just fine after each incident. All I've noticed after the first series was the average load (as reported by top) staying at 3 after even though the system was idling with cpu(s) staying at close to 100% idle for quite a long time. After rebooting the load went back to 0 and stayed there after the second incident even without rebooting. if anyone's interested I can provide traces of the second incident still lingering in dmesg output (and whatever else might be useful) |
I reported issue #1364. I now suspect that it could be a stack overflow issue. I have been monitoring stack usage of my system since that time. Under "my load conditions" the stack averages 6000, with daily peaks of greater than 7000. Given that std kernels are compiled with 8k stacks you can see that high or demanding load conditions might overflow the stack and cause the machine to crash. And at crash time it may any kernel process which drove the stack over the limit, making it hard to see cause and effect. There are a few open issues about zfs stack usage, and one of them shows how to compile a kernel with 16k stacks. I have not done this yet but you might give it a try to see if your crashes go away. |
@lukasz99 Any stacks which you can post would be helpful. |
Bug complete reproduce-able on kernel 3.7.10-1.24-default openSuse 13.1 with zfs 0.6.2 (fresh download). Process: No logging at all: VM just hangs in most cases. In cases the VM is still working the IO is stuck, so not a lot can be done. Running the same program on the same server but then with ext3 as disk format: No problems at all, 100% stable. I can make the VM with the database available for analysis. Can this bug be prioritized: This can become 100% blocking for using mysql on zfs. |
Deadman panic sounds like a bad idea. I believe in production, it should
|
@jurt1235 Can you try and reproduce this issue with the latest source from master. If so are any errors logged in to dmesg? @aarcane I tend to agree. The deadman referenced above was tweaked for Linux just to post a zevent. Once #2085 get's merged it would be a trivial to add a script which grabs debug information and panics the node if desirable. For HA solutions I can see why that might be desirable. |
Test running for 6 hours now with the version I pulled from GIT yesterday, no problems so far: |
For me got head is much better than 0.62 both in performance and stability (last update was about a week ago for me). Perhaps you could release a 0.63? 0.62 is broken on Ubuntu 12 LTS and Ubuntu 13 latest.. Matthew |
Apologies, mistakenly closed. |
I can't reproduce it at all - to solve this I haven't used anything but Matthew
|
I have not seen this in quite some time; I'm not the OP, but from my perspective this issue can be closed as stale. |
OK, then I'm going to tentatively close this. If anyone sees this issue again we'll reopen it or file a new issue. |
Thanks for all your work on ZFS by the way. Matthew On 6/12/2014 12:16 AM, Brian Behlendorf wrote:
|
I am suffering from the same symptoms as #1364 so I thought I would resurrect this bug. If I trigger a scrub (or put the system under heavy load) the system hangs with the HDD light stuck on. I was not even able to get the kernel to record a crash file. The logs don't seem to reveal anything.
** May be related to but #1482 as well? **
Ubuntu 12.04 LTS.
ZFS is the Ubuntu PPA:
ZFS setup is two mirrors split across two disk controllers. As you can see I have no CRC errors:
I have a ZVOL fish exported via iSCSI:
dmesg:
The text was updated successfully, but these errors were encountered: