-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol loose partitiontables in high IO situations -- also crashes qemu kvm #10095
Comments
could you describe your environment a little bit? kvm / vm settings/config for example ? what kvm version/management solution? no corruption/dataloss besides partition-table loss? where are the zvol's located on? what is meant by "when we allowed aggressive swapping" ? oh - and you don'r run swap on ZVOL, do you ? ( https://github.com/openzfs/zfs/wiki/FAQ#using-a-zvol-for-a-swap-device ) |
Hi devZero, sorry, i somehow missed your message. example libvirt definition:
virsh -V virsh -V does not matter. Kernel does not matter too ( had 4 LT, 5 LT, latest 5 ). There is no corruption. Just the partitiontable gets killed:
zpool with cache and log ( but that does not matter too, it happens also to a single disk backed zpool with or without cache/log )
And no, we dont run swap on zfs volumes, its not that bad. qemu-img convert loves to be No. 1 reason to run into trouble. Here is a crash from today:
followed by more fun:
and another qemu-img process, with another job:
And here even qemu exploded with segfaults:
the qemu-img command is:
The volumes are created via:
Compression is on. Inside of the guests, the party looks like this: It would be quiet supernice, if someone could give a hint why this happens like this. Its effectively reproduceable. The % probability is proportional to the IO happening on the server. Running 2x qemu-img at once is basically a 100% gurantee that zfsvolumes will explode and the kvm qemu processes will die with that in segfault or how ever. |
A simple: #zfs destroy kvm-storage/test3 will take 79 seconds while running: #qemu-img convert -p -O raw test1-flat.vmdk /dev/zvol/kvm-storage/test1 while nothing else is running on this server. Output iotop:
top will look like:
40 load on a 32 core CPU with a single zfsvolume doing IO ? |
So all in all it seems to me, that the zfs scheduler will leave zvol's back to die without any IO time for minuets, leading to this situations. Howto force zfs to give at least a minimum IO time to everything ? |
does this happen with zvols only? how does it behave with ordinary file, e.g. if you convert vmdk to zfs backed file? also have a look at this one: and maybe also this one: |
oh, and what about this one ? |
This is: HDD -> Zvol ( same process blocking multiple times )
With the same going to the zpool mountpoint /kvm-storage: #qemu-img convert -p -O test1-flat.vmdk /kvm-storage/test_1 i was not able to reproduce the issue. So maybe its indeed something that goes for zvol's only. @devZer0 I read through all of them, and i will create now a /etc/modprobe.d/zfs.conf with: zvol_threads=8 And restart/continue my testing now. |
This comment has been minimized.
This comment has been minimized.
mhh ,weird. i would try triggering that problem with an mostly incompressible and consistent "write only" load to see if it makes a difference. maybe qemu-img is doing something special... maybe "shred /dev/zvol/kvm-storage/test1" or something like that.... (or disable compression and use dd if=/dev/zero of=/dev/zvol/kvm-storage/test1 bs=1024k... ) anyhow, it indeed it looks like some scheduling/starvation issue.... |
Ok, i am not able to kill a single zvol IO ( fs check of qemu VM ) with 3x shred on zvol. But it will report streigt frozen ... failed command: READ FPDMA QUEUED ... status: { DRDY } and the whole stuff with a single qemu-img. -- Intresting to mention, that it seems 1x qemu-img is going to kill things while 1x qemu-img AND 3x shred at the same time seems even after minuets not to kill things... maybe this qemu-img get so little read IO because of the shreds, that it actually does not get the chance to block things. So this qemu-img does a pretty good job ( what ever it actually does ). I will start now trying my luck with https://github.com/openzfs/zfs/wiki/ZIO-Scheduler |
i can easily reproduce your issue on my recent proxmox system with latest zfs 0.8.3 while issuing the following command on the proxmox host (filling the zvol of an offline vm): dd if=/dev/urandom of=/dev/zvol/hddpool/vms/vm-111-disk-0 bs=1024k shortly after, inside a kvm vm , the following command shows no read progress anymore: while true;do dd if=/dev/sda bs=1024k |pv >/dev/null;done in the end, inside vm i'm getting: [sda] abort i.e. i/o completely got stuck what makes things even worse: the vm is not using zvol on hddpool but on ssdpool, so apparently it's about i/o starvation in the kernel/zfs layer and not in the disk layer.... i would call this a serious issue (which is not about qemu-img at all). not sure if kvm or zfs is to blame here... please , can you have a look @behlendorf ? |
yes, it does not matter what OS or version. i tested here fedora 31/30, centos 7 and archlinux with versions between 0.7.x and latest 0.8.3 in different hardware setups ( raid controller, hba, nativ, jbod ). So far its more and more safe to say: qemu-img on datasets work without any issues. I tried this settings in different combinations and numbers:
aswell as:
Sometimes the error comes earlier, sometimes it ( seems ) to come later, while there is no 200% failsave way to reproduce this, as sometimes it might work, most times not ( and if not then sometimes earlier or later not ). So far the work around: Dont use qemu-img with zvol's ionice wont help too by the way. |
i have found with further testing that this seems completely unrelated to kvm. i also get completely stalling read from zvol on ssd when doing large streaming write to zvol on hdd, i.e. : dd if=/dev/urandom bs=1024k |pv > /dev/zvol/hddpool/zvol1 makes dd if=/dev/zvol/ssdpool/zvol1 bs=1024k |pv >/dev/null hang completely furthermore, after about 4-5gb of writes, performance drops significantly, by at least 3/4 (from >120mb/s to <30mb/s) - even without reading in parallel. |
yes, thats indeed another issue. But honestly i am already quiet happy i found the cause of this segfaults and qemu-kvm crash's. But you are right, zfs's zvol code, at least here for linux seems to have quiet some space for improvement. while with qemu-img something happen to zvol, that things go down the hill, while with dd it does not lead to crashes ( while the performance might still feels like my 3,5" floppy ). |
i think this is just a matter of difference in write performance/volume. try disabling compression on the zvol and try writing with larger blocksize (dd if=/dev/zero bs=1024k....) to it, i bet you will see the same issue like with qemu-img btw, i can NOT reproduce the issue with blocksize=1k, as this won't fully saturate i/o - but with bs=8k i can easily trigger it. btw, on my system ashift for hddpool is at 9 and for ssdpool is at 12. |
yeah, well unfortunatelly you are right, if the IO just takes long enough dd if=raw of=zvol bs=4k conv=fdatasync,sparse things will go down the hill with 500 GB of raw data. During the dd, two random KVM's died....
So we can consider zvol's close to unuseable for really reliable stuff. Back to lvm i guess... super sad. |
here we go. unfortunately, i don't see significant improvement. compared to file or lvm, writes to zvol are slow, while reads severely starve
|
Hi guys, proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve) I'm getting crashes only after high load. It usually hangs out the VMs entierly and the zpools traffic. The host has 128GB RAM, and a Threadripper; so I don't think there is any memory or cpu limitation involved. I've found out this crash in syslog: |
FWIW, just in case it inspires anyone's thinking on this: On a Slackware box with XFS (not ZFS!), under the heaviest of I/O (e.g. multiple rsyncs running), I occasionally see the same And when it happens, it eats the partition table for the VM's boot "disk" so regularly that I have a dd-based script standing by to replace the first "sector" of the raw disk image file. I don't, however, see any corruption of the host XFS filesystem. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
@layer7gmbh , what type of kvm host is this? it seems it's NOT proxmox based, but most (if not all, besides this one) of the reports i see on lost partition tables are in the proxmox forum/community. is vzdump being involved in the process of loosing partition? anyhow, i think it's not a zfs issue but a kvm/qemu issue: https://bugzilla.proxmox.com/show_bug.cgi?id=2874 |
@layer7gmbh , can you provide some feedback so we can decide if this needs to be left open ? |
regarding hang of virtual machine there is some interesting findings at https://bugzilla.proxmox.com/show_bug.cgi?id=1453 , but not sure if it's relevant here |
@layer7gmbh , does this problem still exist ? |
@layer7gmbh , you seem to use sata disk in virtual machine configuration. can you confirm that loosing partition table only happened in virtual machines with virtual sata disk? apparently, it seems there is an issue with that in qemu: https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51 |
probably this kind of issues were not ( only ) related to SATA. Also according to your qemu post they talk about unclean shutdown situations where partitiontables might get lost. That was not the case here. Unfortunatelly i am not able to tell you much more. With our current solution: Kernel 5.15.83-1-pve It seems we dont run into this kind of issues. |
the sector 0 corruption bug is found and being fixed (related to sata/ida & backup interruption), see https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51 did it happen again for you? if not, i think we can close this issue !? would you be so kind @layer7gmbh ? |
zfs-0.8.3-1
zfs-kmod-0.8.2-1
Kernel Module: 0.8.3-1
SPL: 0.8.3-1
Kernel: 5.3.11-300.fc31.x86_64
We run here sometimes into this issue:
We saw that on other KVM hostmachines too running on ZFS.
It seems the more load we see on the ZFS, the higher the chance of this happens.
After that, the ZFS zvol lost the partitiontable. It can be recreated easily by using something like testdisk.
In addition you have to reinstall the bootloader of the OS to get the system back booting.
This happens randomly across all hostmachines, across all kind of virtual machines, HDD's or CPU's. It does not hit the same VM's. Its all random.
I observed that before, when we allowed aggressive swapping, the ZFS systems started to answer slowly ( all zfs commands took long enough for the dead man timeout ) while the zvol's responded normally. At that time, this issue happened basically every 2-3 days.
After turning off swapping, the issue was gone. Until yesterday, when it just came back out of nowhere.
Same systems using lvm or qcow2 filestorage this issue does not happen.
Someone any idea ?
Thank you !
The text was updated successfully, but these errors were encountered: