-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High zvol utilization - read performance #3888
Comments
related to #3871 Highly uneven IO size during write on Zvol ? CC: @ryao @zielony360 are you observing this behavior right from the start upon having booted up the computer, server ? any special configuration underneath the zvols or zpools ? cryptsetup ? |
@kernelOfTruth uptime:
So the problem isn't connected to prefetching metadata to ARC after reboot. Pool configuration: built on SAS disks in 2 JBODS, using multipath round robin to physical disks. lz4 enabled. Changed zfs parameters:
I was observing very high pm% in arcstat, so I disabled prefetching now, but it didin't help. All zvols have 128k volblocksize, because the operations have large blocks generally (as in this case). Compression ratio from 1.5x to 2.0x. All zvols are sparse. I also set zfs_vdev_async_read_max_active to 10 - no change. cryptsetup - no. There is no problem in writes, which go up to 1 GB/s. Best regards, |
in reference to what state ? what did change ?
was that supposed to mean
or are you really running 0.6.4.2 ? 0.6.5.2 has major improvements related to zvols: https://github.com/zfsonlinux/zfs/releases
which could help with your problem |
When I was testing before production period reading with dd with big block size, it was hunderds of GBs.
If it's going about configuration, there is no change.
Yes, I am using 0.6.4.2.
I know, but I saw many kernel panics here at Issues, even on 0.6.5.2, so I am waiting for bugfixes. :] That's production backup system, so I cannot afford crashes. |
@zielony360 alright :) Could you please post output of
and
that perhaps might give some more diagnostic information If the others don't have any quick idea on how to improve this the IRC channel might be your best bet - perhaps someone there has encountered similar behavior Thanks |
@kernelOfTruth Thank you for your being involved. ;-) |
@zielony360 sure :) the next step I would test would be to set if there's lots of copying ( +/- rsync) involved - disabling it isn't recommended for most cases according to several sources I read (and only in special cases could improve performance or reduce the needed data to be read) but your box is clearly better outfitted (132GB ECC RAM) than mine (32 GB) and your i/o is also much higher - so other approaches would be needed ... If lots of data in a "oneshot" way is transferred then there would be two potential "helpers" https://github.com/Feh/nocache and
after that I'm out of ideas :P |
Are you sure caching only metadata would be good, looking at ARC statistics? We can see there is a lot of data hits, more than metadata. We will lose data caching (we don't have L2ARC). Unfortunately, we can't use that solutions, because zvols are used by Windows machines. Normally Veeam is working there, but now we are migrating some backups from one LUN to the another at NTFS level (basic copy/paste). Veeam has also bad read performance while restoring VMs, so it can be connected. |
no - that's just what I would do to gain experience how ZFS behaves (obviously I've not that much experience with operation and internals yet; also this is a home desktop, power user workstation - so the after-affects would be negligible - servers of course have much higher requirements of reliability and consistency in operation), therefore that pointer to IRC chain of thoughts were that the issues could be related to zvols (-> 0.6.5.2 ? pre-0.6.5.y ?), metadata (caching ?, rsync ? ARC fighting with pagecache ?) and potentially others (now that you're mentionig it again: NTFS ?) If you're using NTFS and ntfs-3g driver (fuse) make sure to look into the big_writes , max_readahead and other options sorry when potentially pointing you in a wrong direction |
You probably are seeing read-modify-write overhead on zvols where NTFS uses a 4KB default cluster size while ZFS zvols default to volrecordsize=8K. Either use zvols created with volrecordsize=4K or format NTFS to use a cluster size of 8KB and performance should improve. The volrecordsize=8K default is an artifact from SPARC where the memory page is 8KB. |
No. Look at iostat in my first post. One LUN is only reading and the second one only writing - simple copy/paste. It uses 512 kB IO block (avgrq-sz / 2). So the block size is not a case here.
Yes, NTFS on Windows 2012 through FC target (I mentioned it already), no on Linux. There is no rsync anywhere. Simple Windows copy/paste in this case. @kernelOfTruth, @ryao |
@zielony360 I suggest running You could try backporting the appropriate commits to 0.6.4.2 to deal with this: 37f9dac The first does not apply cleanly to 0.6.4, but the conflicts are trivial to handle. I took the liberty of doing that here: https://github.com/ryao/zfs/tree/zfs-0.6.4.2-zvol |
@zielony360 Did you try the backports? |
@ryao @kernelOfTruth Thank you for your effort. Few days ago I upgraded to 0.6.5.7. The latency on zvols is similiar as previously, however overall performance is better. But read latency is still about 50-100 ms during operations on each zvol. I believe it can be related to high metadata miss:
So a little above 50% hit rate of metadata. This is not the problem of lacking memory:
I think metadata is being evicted too much (zfs_arc_meta_strategy is 1). Is there a way to tune this, so in the effect I will have better metadata hit rate? primarycache=metadata is the last option for me, because as you can see above, I have 68% data hit from ARC. I do not have L2ARC nor SLOG. Currently tuned ZFS parameters:
Compression lz4 ratio 1,5x on pool. We are using only zvols through FC target. |
@zielony360 Looking at your initial iostat output, I'm curious as to how you managed to get avgrq-sz of 1024 (512KiB). Normally it will be capped on the block device's maximum readahead size which defaults to 128KiB. Did you do a |
@dweeezil I haven't changed readahead. But is read_ahead_kb this parameter which decides about max IO size? I didn't think that too big size, which will be divided later, will be bad for performance.
targetcli attributes of all zvols:
|
@zielony360 What's your storage stack like? In particular, what's actually using the zvols? You mentioned NTFS. Is it ntfs-3g? Or something else? |
@zielony360 Oh, I think I see. You're exporting the zvols as iSCSI targets. Is the NTFS access from regular Windows clients? |
@dweeezil No, we are exporting zvols through Fiber Channel to Windows, which has its NTFS on it. It is used for Veeam backups. So generally large (about 512 kB/s) IOs, 70% async writes, 30% reads. |
@zielony360 Got it. Do you see the high read latency for read-only workloads? Or does it require a mix of reads and writes? |
@dweeezil Currently we are copying file on NTFS from one LUN to the another. Then we will use this to read latency test, but look at it now:
Those are the only IOs at the moment on pool. High latency on reads, low latency on writes. arcstat:
You can see metadata miss is generally significant. arcstats fragment:
|
@zielony360 Ugh, the perils of sticking my nose into a bunch of similar issues at once: I had totally overlooked the fact that your iostats were on the zvol devices and not on the underlying vdev disks. Your earlier arcstats showed a whole lot more metadata being cached:
I presume the last bit of output (the one with |
@dweeezil Yeah, iostats with zd* devices are zvols. The thing is that zvols have 100% utilization during reads, but physical disks - not. So due to this I think metadata misses may be a reason. "meta_size 4 29807262208" is from 10 months ago, when we had 0.6.4.2. Last arcstats, from this month, are after upgrading to 0.6.5.7 and yes, there was a reboot during this, but it was over a week ago. From this time there was a plenty of daily backup cycles, so metadata should already resist in ARC, but it isn't (hit rate below 50%). In general I can say that 0.6.4.2 had better metadata feeding/not evicting policy than 0.6.5.7. As you can see in earlier posts, that was about 90% metadata hit rate comparing to today below 50%. Overall performance is better on 0.6.5.7, but I believe that if we have similiar metadata hit rate to 0.6.4.2, it would be even much better. |
One more thing. Please remember I disabled prefetch shortly before upgrade. Metadata hit rate coming from prefetch was good, but data - not. On 0.6.4.2 disabling imoroved performance a bit. Should I try to enable it now on 0.6.5.7? I have also experience on Nexenta ZFS and it tells me it would be great to distinguish prefetch_disable parameter to data_prefetch_disable and metadata_prefetch_disable. I would leave metadata prefetch enabled and disable data prefetch. |
@dweeezil @kernelOfTruth @ryao I was making some tests if metadata miss can cause high read latency. So I added one 200 GB SSD to the pool as a cache and set primarycache=metadata. As expected matadata miss from about 60% to <10%, but such setting made things worse: higher latencies. So I set primarycache back to "all" and performance came back to previous levels. L2ARC hit rate was and is very low: <10%. What is strange, that 3 TB 7200 RPM disks are still not too much utilized, while zvols - are 100% busy (iostat). Example output about HDD disks:
To remind I have 5 x raidz2 (10 disks each) in the pool. I think read latencies could be lower until HDDs are not fully utilized. Also look above at avgrq-sz, which with 128k volblocksize for each zvol could be greater (?).
@dweeezil We were also doing the test you suggested: 100% read. Latencies was high even with such IO profile. Not more than 40 MB/s read while random and not more than 150 MB/s while sequential. I am all the time on 0.6.5.7 now. |
I revealed why reads are needed during making backup. Veeam uses reverse incremental backup method, so data are firstly written to separate file (at NTFS level) and then are copied to another, bigger file. Unfortunately, ZFS zvols do not implement XCOPY (#4042), so it has to be copied physically. Ideal workaround for this, until XCOPY is implemented, would be caching current writes to ARC/L2ARC. Do you have an idea if it can be tuned somehow for this purpose? |
Hello,
I am using zfs 0.6.4.2 on kernel 4.0.4. I share zvols through FC target. The pool: 5 x raidz2 (10 disk each). Pool space usage: 29%.
The problem is in read performance, which generally was much greater, but now during copying file from one LUN to the another there is high zvol utilization causing low read throughput:
The physical disks (3 TB 7200 RPM) are not overloaded:
arcstat:
Writes are being flushed by NTFS, which used zvols, every 6 seconds.
Can you suggest some parameter tuning or it lies deeper?
Best regards,
Tomasz Charoński
The text was updated successfully, but these errors were encountered: