-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs 0.6.5 and write performance problems #4512
Comments
@crollorc There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as |
@dweeezil Thank you for the speedy response!
Note that sync=disabled was setup on all pools a long time ago to get reasonable NFS performance
We use an Areca ARC-1880 controller and a bunch of WD30EZRX disks. Here are some details on system memory, ZFS tunings and iostat for a sample disk
Below are some details for the kstats mentioned. dmu_tx_assigned is increasing fast and dmu_tx_group is increasing slowly:
Thanks again for any pointers. |
@crollorc These aren't a good sign:
Please try One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server. |
@dweeezil Thanks again for your help!
Yes, sorry for this red herring!
Some txgs below. As far as I can tell, for this sample the avg sync time is 0.785s and max dirty data is 389MB. Is that right? Does it look like we have excessive dirty data? If so, should we change some options to reduce this? Also, the thing I find most confusing post upgrade is the halving of average write size and doubling of write IO. Could this indicate some sort of issue with the coalescing or merging of txg async writes?
Thanks again. |
@crollorc I may have gotten ahead of myself here. So much has changed since 0.6.2 and I latched on to the new write throttle. There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats. Are these changes:
left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats. |
I've added some arcstats and arc_summary below (hopefully with the right params).
They are left over and were tuned to deal with some problems with a large ARC in 0.6.2. We were also dropping caches (echo 3 > /proc/sys/vm/drop_caches) every hour on 0.6.2 but stopped post upgrade. We are very dependent on the high ARC hit rates we get (ARC:93% L2ARC:68%). How would i default these options from the live system? What size will the ARC default to and will this result in the ARC being cleared or will it resize dynamically? Thanks again for your help.
|
@crollorc Please |
Thanks again for all your pointers.
We haven't noticed any problems with the ARC or reads in general since the upgrade so I'm unclear when I should be looking at arcstats? Because the ARC and L2ARC are so efficient, only a small proportion of disk IO are reads -
However writes have doubled post upgrade which gives us a lot less disk IOP headroom. When IOPs hit 220 or so IO queue depth and service time jump followed by SAN and client load linked to IO_WAIT. |
@crollorc Both your If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set |
One thing I didn't mention is that we have not done a 'zpool upgrade' so that we can downgrade if necessary. We do have compression=on and I disabled this to see if it would affect what effect it would have. iostat avgrq-sz has doubled (to where it was pre-upgrade) but w/s has not dropped. Also, there are no wrqm/s which seems odd to me.
Any ideas on data I could provide, tests I could do or steps I could take to help identify the root-cause or mitigate the problem? Thanks again. |
We tested sync=standard on 0.6.2 with an SSD based ZIL but the performance was poor and so we moved to sync=disabled. We are planning to try sync=standard again to see if this eliminate the small writes. Does this sound reasonable? |
We are now running with sync=standard and request sizes to the slog ssds and the vdev are both now 15 sectors so it looks like we are hitting the same issue with sync=disabled|standard. Any ideas!? |
@crollorc Unrelated to this issue, your earlier issue with |
@dweeezil Thanks for the advice. I hiked up zfs_dirty_data_sync by a factor of 5
This has lead to larger slower txg syncs but has had no effect on write sizes unfortunately
The only way we've found of increasing the average write size is to set compression=off. |
@crollorc I could not duplicate this problem with a quick test. I had no problem getting about 128KiB average write size as reported by iostat when copying large files to an NFS exported zfs filesystem. When rsyncing lots of small files, the average write size was all over the place, exactly as I'd expect. I'll try a few more tests with varying pool geometries but at the moment, I'm not able to reproduce the problem. |
@dweeezil Thanks for trying. Looking at the NFSv3 traffic, I found that most (70%) of writes were 4K despite having wsize=1048576 in all client mount options
I had a look at the clients and I think the rootcause for the 4K writes may be that many of these clients are running Linux and Windows KVM VMs which I would guess probably have a 4K blocksize. However, I don't understand why iostat avgrq-sz halved and iostat w/s doubled post upgrade. |
@crollorc I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely. Could you try a simple |
One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation? We ran without compression on 0.6.2 and I noted a large proportion of writes were merged in iostats. Last year we turned on compression and merged writes were significantly reduced but still present. So far on 0.6.5 we see no merged writes at all (prior to these tests) with compression on and off. You can see this on this graph for a sample pool member (unfortunately I don't have graphs illustrating merged writes with compression off on 0.6.2)
Compression=off at the moment but I cannot do this on an entirely quiescent fs right now. For the direct dd, I found that the request size jumped to 100 - 300 for the duration of the test with intermittent drops (i guess between flushes). Also there were merged writes during the test. For the NFS dd, I found that the request size jumped to 100 - 150 for the duration of the test. There were also a few merged writes but much less than for the direct dd. |
I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code. As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like |
Ok, I guess this was demonstrated by a jump in iostats avgrq-sz (versus dd bs) in the dd results.
We are already using the noop scheduler. I wonder if others using compression=off and a noop elevator see no write merges as this was definitely not the behavior we saw in 0.6.2? In any case, thanks for all your help |
@crollorc Can you please clarify whether this problem occurs with 0.6.4? One interesting post-0.6.4 change was the task priority fix. Prior to 0.6.5, the zio write threads ran at a much lower priority (higher raw priority value under Linux) than they were supposed to. If you had a lot of nfsd threads pushing writes to them, it's possible that the ZFS aggregation code had fewer IOs to work with. If you don't see this issue with 0.6.4 but do see it with 0.6.5, it might be due to the task priority adjustment. |
Hi, I'm seeing the same behaviour after updating from 0.6.4 to 0.6.5. Version 0.6.5 is much heavier on the writes being flushed to disks. My observations:
My pool:
Txgs:
|
@erikjanhofstede i never got a chance to test 0.6.4 unfortunately. did you track iostat avgrq-sz as I found this halved for me with compression=on between 0.6.3 and 0.6.5? |
@crollorc I only tested between 0.6.4 and 0.6.5. I don't have any stats from 0.6.3. I tried with and without compression on 0.6.5, but didn't notice any difference by the way. I'm trying to reproduce the issue on a testing server, but that pool is almost empty and there's no workload either. @dweeezil If this issue is caused by the new priorities, is it possible that disabling spl_taskq_thread_priority can give use any insight on this? |
@crollorc I don't have any graphs when I was updating from 0.6.3 to 0.6.4 many months ago, but there was no performance regression for me as far as I can remember. Only improvements. |
@erikjanhofstede in the latest master code If you have a test system available and we suspect the thread priority changes. Then I'd suggest running your test case using the master source and collecting the request size histogram. You can then revert the thread priority changes, 1229323, and run the same test to see if there's a difference. I'd be very interested in your results. |
@behlendorf I've reverted 1229323 (had to fix some conflicts though). My test consists of running the commands below in parallel, for the rest the servers is idle, with an almost empty zpool, unlike the server that I posted graphs about before.
The results running on master, polled over a period of 60 seconds (zpool iostat -r 60 2):
The results running on master, without 1229323, polled over a period of 60 seconds (zpool iostat -r 60 2):
|
@erikjanhofstede thanks for running the test. Based on these results it doesn't look like 1229323 has any real effect. |
@behlendorf I'm unable to reproduce the issue I'm having on my testing server, but it's very clearly visible on production servers. Can you think of a safe way to try some things out on a production server? In the meantime I'll continue to try to reproduce the issue. |
Over here it's the same as @markr123 CentOS6, no zvol, any upgrade from 0.6.4.2 to 0.6.5 results in a dramatic write performance degradation. You can see my findings in this thread, but if you need any more info/testing I'll be happy to assists you in this. |
Hi. This problem/ticket has now been opened for over a year and the performance issue is stopping most if not all people from upgrading past the 0.6.4.2 version. Is this problem being looked at or worked on? Anything I can send you? as it seems to have stalled. A new version of ZOL may come out but if this problem is not fixed there is no point as no one can upgrade to it. thanks for all your hard work on ZOL. It's the best filesystem out there for Linux. |
I believe @ryao is currently digging into zvols at work. There may be a light at the end of this tunnel yet. Plus, people can't stay on 0.6.4 for compatibility reasons.
So many features are getting added that imports on even slightly outdated systems don't work RW.
Consumers are upgrading, realizing there's a problem, trying btrfs, failing, trying other things, failing, and coming back to zfs angry (seen this a few times). Users expect everything to improve by orders of magnitude with each release and work by way of an easy button which let's them tune all the details without knowing them or having to actually tune them... This pressure goes up to their IT staff, and the buck sorta stops there. Its a major perception problem because they don't realize this isn't a commercial effort (on Linux), and expect commercial grade function for free...
My $0.02 is that we can't have performance regressions in releases, qualifies as a first order bug at the same level as data integrity. Doesn't have to improve, but can't get worse. The fallout lasts years, and with the shaky history behind ZFS in the eyes of corporate decision makers, we need many years of solid releases to build confidence.
Fact that we are out of tree makes that many times harder - Linus can do something silly and zol just has to adapt. Past that case, in our own code, we should be testing our ops for time and resource consumption like we test our features/data for return values, to have consistent metrics across commits since a change in the spa could cause the dmu to wait or memory usage to spike elsewhere for buffers and only testing the functions changed by the commit wouldn't expose those metrics. Its not a small problem to address, but probably doable, with enough blood, sweat, and profiling mechanisms.
|
@sempervictus agreed, ZFS needs a kind of "Phoronix Test Suite" for Filesystems that tracks performance (and latency ?) regressions and improvements regularly, the buildbots & tests will test integrity and functionality but I guess asking for performance or latency related regular tests (for each commit) would be too much and not entirely suitable ? Dedicated runs on servers (from Raspberry Pi, Pi2 - "little, small systems) or server farms (Lustre, etc.) would be much better suited to catch "corner", scaling and other cases |
@sempervictus agreed. We need additional automated testing which is run regularly in order to prevent performance regressions.
I think that would be an entirely reasonable place to start. If we can put together a meaningful battery of ZFS performance tests for the PTS which take less than 6 hours to run they could be added to the buildbot and run regularly. Having a predefined suite of tests anyone can easily run to benchmark their hardware would be convenient. @kernelOfTruth if you're already familiar with the PTS would you be interested in putting something together? |
Hi, Is there any progress with this? Sorry to chase but I would really like to upgrade to the new versions of zfs but am stuck on 6.4.2 until performance issues are fixed. |
You might be interested in giving 0.7.0-rc4 a try, it has various performance enhancements. There are also various ZVOL/ZIL improvements currently in the works c/o @ryao. Not sure if any of them are relevant to your usage case. Feel free to look through the list of current pull requests and give some of those a try as well. |
thanks for coming back so quick on a Sunday. my main problem has been lots of rsync writes coming in at once to zfs. 6.4.2 handled these really well, but from 6.5 onwards the cpu usage went high and the rsyncs and all filesystem writes blocked. samba writing was also stalled. |
Anybody who's experiencing this issue already tried the 0.7.x release? |
I do with HEAD at 3769948 and RHEL6 kernel.
|
@snajpa And still having these write performance issues? |
I'm running 0.7.1 for a couple of days now and the write performance issue I was having with 0.6.5.x seems to be gone! |
Unfortunately I can't upgrade that often, next round of updates won't be for another few weeks. I'll get back to this issue within a ~month max with concrete data. |
Hi @erikjanhofstede, could you please share the zfs parameters applied after upgrade to 0.7.1. Thanx in advance. |
@nawang87 I have some custom parameters in place for this specific configuration and workload. I didn't apply any custom parameter settings except renaming zil_slog_limit to zil_slog_bulk for running 0.7.1. |
I performed some tests and noticed similar like @ab-oe that without commit 37f9dac performance was pretty good on 0.6.5.x (kernel 4.4.45) ZOL 0.6.5.x with commit 37f9dac
ZOL 0.6.5.x with reverted commit 37f9dac
ZOL 0.7 HEAD and 0.7.1 similar result like below:
Noticed that ZOL head by default also has only one zvol kernel process run:
htop during I/O shows that zvol process does not occupies resources:
spl module with
but anyway htop during I/O shows that zvol processes do not occupies resources:
However ZOL 0.6.5.x with reverted commit 37f9dac shows that zvol processes nice occupies resources (CPU%):
|
@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems. |
Listing the following commits (for reference), that touch upon bio work (or related), may it help to track down issues and avoid having to revert all of it ☕️ : d454121 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c [via git search] 37f9dac zvol processing should use struct bio [mentioned here] 2727b9d Use uio for zvol_{read,write} a765a34 Clean up zvol request processing to pass uio and fix porting regressions #4316 Reduce overhead in zvol_write/zvol_read via existing hold |
I noticed one more performance reducing which was provided in ZOL 0.7.2: ZOL 0.7.1-1:
ZOL 0.7.2-1:
After revert each single commits seems that commit a5b91f3 introduced that slowdown: ZOL 0.7.2-1 without a5b91f3 :
By the way: with direct flag performance is generally good but non each application can use direct I/O so it would be nice to refactor bio work: ZOL 0.7.2-1:
|
I may provide additional tests on a server experiencing the same issue (on 0.6.5.x). |
@sempervictus @arturpzol @odoucet I've read through this issue again, but I'm not sure I completely understand. Can you help me understand your target workload. From the recently posted test results it looks like your focus is entirely on synchronous single threaded IO which is unfortunately exactly where I'd expect performance to be the worst. Particularly for a raidz pool without a dedicate log device. Let me try and summarize the state of things in 0.7.2 and master do my best answer the questions posted. Then I can properly investigate your performance concerns and we can determine what can be done.
The flaw in the bio work referenced above has already been reverted. This was done in commit 692e55b, and 8fa5250 which reinstated the task queues to resolve the asynchronous IO performance issue.
That's a bigger hit that I would have expected, we'll have to look in to that. Using the current master code I tested this worst case scenario as posted above. The volume was created with sync=always and I ran a single threaded That all said, you're right. In my testing it does look even worse than what I would have expected I see a large write multiplier which I can't immediately explain, let me investigate what's going on and get back to you. On the upside, if you stick with the default values I do see reasonable performance numbers. |
@behlendorf In my main enviroment I use SCST which shares zvols to iSCSI initiator. Initially I noticed with ZOL 0.7 master very long disk format on Windows side: ZOL 0.6.5.x with reverted bio work:
ZOL 0.7 HEAD:
so I tried to eliminate SCST, network and iSCSI initiator layer and switched to local I/O using I understand that with Of course, I can use default value of sync property for zvol but it does not protect me to avoid data corruption in case of power failure. @behlendorf are you able to perform test with |
@odoucet you can simply use 0.6.4.2 or revert bio work on first released 0.6.5. It is diff between released 0.6.5 and 37f9dac
or use For next released version of v0.6.5 (e.g. v0.6.5.1) it will be harder to revert it because of more source code dependencies. |
Make sure to run zfstests after those changes, ZFS is about data integrity after all, if the modified filesystem code puts your data through the grinder (data corruption, no integrity, issues) - it's pretty much useless |
@kernelOfTruth I can agree in 100% that ZFS data integrity and others need to be tested with reverted bio. |
@behlendorf I tested 37f9dac with PR #5824 on top has the same poor performance as described in most posts here. Restoring the 32 zvol threads has no impact on performance. The performance on 0.7.0-rc4 was better because of incorrect ranges locked for indirect writes which by the way led to broken data on zvols but after the fix the performance is as bad as it was on 37f9dac. I found another performance regression described in #6728 which affects discard operation on zvols. |
Hi,
First post here so apologies in advance for my mistakes!
We recently upgraded ZFS on our Ubuntu 12.04.5 SAN from 0.6.2-1
precise to 0.6.5.4-1precise. This was done mistakenly as part of a crash recovery.This system has been in-place and stable for several years.
Following the upgrade we have been experiencing higher disk utilisation resulting in intermittent IOWAIT CPU issues for the server and its NFS clients.
We noticed several performance changes post upgrade. Note that the upgrade occurred @ Week 13 on the graphs below.
I would guess that these issues are related to this entry in the 0.6.5 changelog -
I've also seen write coalescing mentioned in github and looking at zpool iostat, I don't see much of this happening -
What would you guys suggest? Downgrade to 0.6.4?
Thanks for your help
The text was updated successfully, but these errors were encountered: