Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel hang on pool scrub Ubuntu 12.04, zfs 0.6.1 using PPA and git HEAD #1515

Closed
mattaw opened this issue Jun 13, 2013 · 18 comments
Closed

Comments

@mattaw
Copy link

mattaw commented Jun 13, 2013

I am suffering from the same symptoms as #1364 so I thought I would resurrect this bug. If I trigger a scrub (or put the system under heavy load) the system hangs with the HDD light stuck on. I was not even able to get the kernel to record a crash file. The logs don't seem to reveal anything.

** May be related to but #1482 as well? **

Ubuntu 12.04 LTS.

Linux mythmaster 3.2.0-45-generic #70-Ubuntu SMP Wed May 29 20:12:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

ZFS is the Ubuntu PPA:

ii  spl                                  0.6.1-1~precise                          Solaris Porting Layer utilities for Linux
ii  spl-dkms                             0.6.1-1~precise                          Solaris Porting Layer kernel modules for Linux
ii  dkms                                 2.2.0.3-1ubuntu3.1+zfs6~precise1         Dynamic Kernel Module Support Framework
ii  libzfs1                              0.6.1-1~precise                          Native ZFS filesystem library for Linux
ii  mountall                             2.36.4-zfs2                              filesystem mounting tool
ii  ubuntu-zfs                           7~precise                                Native ZFS filesystem metapackage for Ubuntu.
ii  zfs-dkms                             0.6.1-1~precise                          Native ZFS filesystem kernel modules for Linux
ii  zfsutils                             0.6.1-1~precise                          Native ZFS management utilities for Linux

ZFS setup is two mirrors split across two disk controllers. As you can see I have no CRC errors:

root@mythmaster:/var/crash# zpool status
  pool: tank
 state: ONLINE
  scan: scrub in progress since Wed Jun 12 22:42:59 2013
    6.32G scanned out of 406G at 16.2M/s, 7h1m to go
    0 repaired, 1.56% done
config:

    NAME                                          STATE     READ WRITE CKSUM
    tank                                          ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2585130  ONLINE       0     0     0
        scsi-350014ee25aa3f34f                    ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2585499  ONLINE       0     0     0
        scsi-350014ee2aff99e83                    ONLINE       0     0     0

I have a ZVOL fish exported via iSCSI:

root@mythmaster:/var/crash# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       1.03T  2.54T  1002M  /tank
tank/fish  1.03T  3.17T   405G  -

dmesg:

root@mythmaster:/var/crash# dmesg | grep SPL
[   16.681556] SPL: Loaded module v0.6.1-rc14
[   27.259078] SPL: using hostid 0x007f0101

root@mythmaster:/var/crash# dmesg | grep ZFS
[   27.239796] ZFS: Loaded module v0.6.1-rc14, ZFS pool version 5000, ZFS filesystem version 5
@mattaw
Copy link
Author

mattaw commented Jun 13, 2013

Cloned both the spl and zfs repo's and after replacing the Ubuntu 0.6.1 PPA debs got the same results (system freeze followed by nothing in the logs after forced reboot) by pounding on the ZVOL using CrystalDiskMark for windows.

The iSCSI target was LIO.

ZFS and SPL v0.6.1-34_g0377189

Very interested in how I can trick this situation into providing debug data.

@ryao
Copy link
Contributor

ryao commented Jun 13, 2013

It is possible that your hardware is failing to respond, which could cause this. cc92e9d introduced a deadman thread that will cause a kernel panic upon detection of such failures. It should be in the daily PPA. Would you give that a try and report back? If the deadman thread induces a panic, there is an issue with your hardware.

@mattaw
Copy link
Author

mattaw commented Jun 14, 2013

Installed the PPA daily for precise however it looks like the packages are identical to those of the PPA stable.
Reboot
Initiate scrub. ping from another machine while logged in as root.
Machine left with hdd light stuck on and unresponsive over the network.
No response to clicking the power button, hard power off and on required to bring back.
No kernel crash log, stack dumps in dmesg or syslog.

Went back to the repo's I used for testing with the built from source debs:

git log cc92e9d
commit cc92e9d0c3e67a7e66c844466f85696a087bf60a 
Author: George.Wilson <george.wilson@delphix.com>
Date:   Mon Apr 29 15:49:23 2013 -0700

    3246 ZFS I/O deadman thread

So if it was the hardware hanging it should have pinged when running zfs/HEAD and spl/HEAD.

Is it worth me setting up a serial console for the kernel messages? I could also rebuild the kernel for immediate printk I suppose. I shall go and play with putting all the disks onto one controller and seeing if I can break it that way in case there is a difference of opinion between the mptsas controller and the onboard sata_nv.

@mattaw
Copy link
Author

mattaw commented Jun 14, 2013

Rather than go the trouble of setting up a kernel watchdog and serial terminal I went and installed Centos 6 and the zfs 0.6.1 packages.

I have been hammering it two bonnie++'s in async while loops and repeated scrubs. No lockups at all.

This data point suggests either a regression in the drivers on kernel 3.2 or there is a bug with zfs on Ubuntu. Where should I go next (although I am tempted to close with -ECENTOSWORKSFORME :) )

@lukasz99
Copy link

Issue #1482 was actually observed on Centos 6/zfs 0.6.1 with open vz on top of that

uname -a
Linux vzh5.dip.mbi.ucla.edu 2.6.32-042stab076.8 #1 SMP Tue May 14 20:38:14 MSK 2013 x86_64 x86_64 x86_64 GNU/Linux

The system has tons (128GB) of memory but it's ECC plus we've run memtest86+ for a while when setting it up with no errors reported. So, I'm guessing, intermittent memory problems don't seem to the culprit.

Two series of 3 lockups each happened twice, each time under quite heavy load due to about 30 remote mysql clients sending about 2TB of data into the server set up in one of the openvz containers; it seems there were no adverse effects - the box seemed to be operating just fine after each incident. All I've noticed after the first series was the average load (as reported by top) staying at 3 after even though the system was idling with cpu(s) staying at close to 100% idle for quite a long time. After rebooting the load went back to 0 and stayed there after the second incident even without rebooting.

if anyone's interested I can provide traces of the second incident still lingering in dmesg output (and whatever else might be useful)

@ColdCanuck
Copy link
Contributor

I reported issue #1364. I now suspect that it could be a stack overflow issue.

I have been monitoring stack usage of my system since that time. Under "my load conditions" the stack averages 6000, with daily peaks of greater than 7000. Given that std kernels are compiled with 8k stacks you can see that high or demanding load conditions might overflow the stack and cause the machine to crash. And at crash time it may any kernel process which drove the stack over the limit, making it hard to see cause and effect. There are a few open issues about zfs stack usage, and one of them shows how to compile a kernel with 16k stacks. I have not done this yet but you might give it a try to see if your crashes go away.

@behlendorf
Copy link
Contributor

@lukasz99 Any stacks which you can post would be helpful.

@nvnobelen
Copy link

Bug complete reproduce-able on kernel 3.7.10-1.24-default openSuse 13.1 with zfs 0.6.2 (fresh download).

Process:
Start database mysql with innodb on zfs partition.
Start program using the database under normal load. Machine hangs within 1 to 30 minutes.
Sometimes machine still responds, but IO will be locked, load will just go up, kill -KILL mysqld has no effect (can not be killed). Killing processes within mysql: No effect.
Changing ZFS parameters: compression on/off, atime on/off: No effect.
Changing mysql parameters: No effect.
Changing mysql versions (upgrade): No effect.

No logging at all: VM just hangs in most cases. In cases the VM is still working the IO is stuck, so not a lot can be done.

Running the same program on the same server but then with ext3 as disk format: No problems at all, 100% stable.

I can make the VM with the database available for analysis.

Can this bug be prioritized: This can become 100% blocking for using mysql on zfs.

@aarcane
Copy link

aarcane commented Jan 30, 2014

Deadman panic sounds like a bad idea. I believe in production, it should
just scream to klog.
On Jun 13, 2013 11:40 AM, "Richard Yao" notifications@github.com wrote:

It is possible that your hardware is failing to respond, which could cause
this. zfsonlinux/zfs@cc92e9dhttps://github.com/zfsonlinux/zfs/commit/cc92e9d0c3e67a7e66c844466f85696a087bf60aintroduced a deadman thread that will cause a kernel panic upon detection
of such failures. It should be in the daily PPA. Would you give that a try
and report back? If the deadman thread induces a panic, there is an issue
with your hardware.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1515#issuecomment-19414878
.

@behlendorf
Copy link
Contributor

@jurt1235 Can you try and reproduce this issue with the latest source from master. If so are any errors logged in to dmesg?

@aarcane I tend to agree. The deadman referenced above was tweaked for Linux just to post a zevent. Once #2085 get's merged it would be a trivial to add a script which grabs debug information and panics the node if desirable. For HA solutions I can see why that might be desirable.

@nvnobelen
Copy link

Test running for 6 hours now with the version I pulled from GIT yesterday, no problems so far:
So not hardware, but also not a not yet fixed issue in the next release, and not this issue despite the same hang behavior.

@mattaw
Copy link
Author

mattaw commented Jan 31, 2014

For me got head is much better than 0.62 both in performance and stability (last update was about a week ago for me).

Perhaps you could release a 0.63? 0.62 is broken on Ubuntu 12 LTS and Ubuntu 13 latest..

Matthew

@mattaw mattaw closed this as completed Jan 31, 2014
@mattaw
Copy link
Author

mattaw commented Jan 31, 2014

Apologies, mistakenly closed.

@mattaw mattaw reopened this Jan 31, 2014
@FransUrbo
Copy link
Contributor

@mattaw @lukasz99 @ColdCanuck @jurt1235 Considering that #1364 (and possible - according to @mattaw the #1482) is fixed, can either of you reproduce this in current HEAD? There have been quite a lot of changes since January (and even more since this issue was created a year ago)...

@mattaw
Copy link
Author

mattaw commented Jun 11, 2014

I can't reproduce it at all - to solve this I haven't used anything but
HEAD in ages.

Matthew
On Jun 11, 2014 1:23 PM, "Turbo Fredriksson" notifications@github.com
wrote:

@mattaw https://github.com/mattaw @lukasz99
https://github.com/lukasz99 @ColdCanuck https://github.com/ColdCanuck
@jurt1235 https://github.com/jurt1235 Considering that #1364
#1364 (and possible - according
to @mattaw https://github.com/mattaw the #1482
#1482) is fixed, can either of
you reproduce this in current HEAD? There have been quite a lot of changes
since January (and even more since this issue was created a year ago)...


Reply to this email directly or view it on GitHub
#1515 (comment).

@ColdCanuck
Copy link
Contributor

I have not seen this in quite some time; I'm not the OP, but from my perspective this issue can be closed as stale.

@behlendorf
Copy link
Contributor

OK, then I'm going to tentatively close this. If anyone sees this issue again we'll reopen it or file a new issue.

@mattaw
Copy link
Author

mattaw commented Jun 12, 2014

Thanks for all your work on ZFS by the way.

Matthew

On 6/12/2014 12:16 AM, Brian Behlendorf wrote:

OK, then I'm going to tentatively close this. If anyone sees this issue
again we'll reopen it or file a new issue.


Reply to this email directly or view it on GitHub
#1515 (comment).

@behlendorf behlendorf modified the milestone: 0.6.7 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants