-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced performance for 0.6.5 in virtualized environment #4880
Comments
I've also demonstrated this on a separate system while trying to back port performance commits from the 0.6.5 stream (taking the merge created by ryao in issue #3888 , which merged back 4 commits) The two deployments are pretty much the same, outlined below (different previous version of our software / Xen etc to earlier post - no SSDs in these machines): Deployment A (ZFS 0.6.4.2) Xen disk driver domain: 4GB, 4 vcpus zpool statuspool: diskconvm
These are 2 x 2TB disks arcstats ##cat /proc/spl/kstat/zfs/dmu_tx ##for i in *; do echo Depoyment B (ZFS 0.6.4.2, with back port) disk driver domain: 4GB, 4vcpus zpool status
errors: No known data errors These are 2 x 2TB disks arcstats cat /proc/spl/kstat/zfs/dmu_tx4 1 0x01 11 528 2552776171 93956863414260 for i in *; do echo
|
@kernelOfTruth I had looked through that issue a little while back (although probably over a month now) and it seemed orthogonal, it is referencing write performance and my issue seems very much to relate to read performance. However, giving it a quick look I see it has moved on since I last looked at it, and there are reference to the bio changes, which my last post showed was also showing the slow down when backported to 0.6.4.2 I'll give it a proper read and try applying any patches to 0.6.5.7 branch to see what happens to this issue |
@kernelOfTruth @dweeezil I've taken a detailed look through the #4512 issue thread - quite a catchup! Anyway, wondering if my issue is the same, as the symptoms I'm seeing would seem to be read rather than write centred, but same sort of area with the reads in 0.6.5.7 not being able to be coalesced as the queue size never gets above 1 in the read phase of the benchmark There was a mention in the thread tracing the performance issue to #37f9dac by one participant - this would seem to be aligned with my second comparison, with both 0.6.4.2 but backport on one of them of this commit - does this make sense? For the first comparision (0.6.4.2 v 0.6.5.7) the kernels were:
For the second comparison (0.6.4.2 v 0.6.4.2 with backport) the kernels were: Any pointers on specific commits to revert / patches to try appreciated - I'll try the commit on the other thread as revert but not expecting it to help here |
One more thing: all block devices in all scenarios above (all 4 systems tested) have: cat /sys/block/sd[a-b]/queue/max_sectors_kb cat /sys/block/sd[a-b]/queue/max_sectors_kb |
I'm still trying to look into this issue but without much luck. The second scenario above in particular, where there are a few 0.6.5 stream commits backported to 0.6.4.2 version is fascinating as it seems to show our performance drop off as for the 0.6.5 stream with a small set of changes under the microscope. If anybody wants any more details of the system / stats taken during tests, let me know - happy to do some further leg work on this |
Hi @koplover, this issue seem to be observed by multiple users, and is related to using zvols in virtualization environments. It would be very useful if benchmarks could be boiled down to some minimal setup, reproducible by different setups and people. I don't know if there's a test setup for ZFS for this already existing. For example, Ceph has a tool to benchmark block device under kvm-like environment called kvmrbdfio, which uses |
@dweeezil I intend to put my details of 0.7.0 rc1 in this issue rather than create a brand new one for essentially the same issue - let me know if it would help to create a new one for 0.7.0 RC1 as you requested in #4512 and I'll get that done Anyway. in essence the situation seems exactly the same under 0.7.0 rc1 as for 0.6.5.x. While running iostat still shows a < 1 avg queue size in 0.7.0 rc1 implying it cannot be fed quickly enough. zpool iostat -r gives:
Unfortunately, blkparse is still not working in this rc, so I cannot provide a trace of the block actions on the zvol. However, this is an Exchange benchmark, and the overlying VM is the same, so I've been trying to reproduce with fio jobs the general pattern but its proving tricky. The general pattern of the disk requests in 0.6.4.2 that I can trace with blktrace is as follows:
The benchmark tool indicates an increase in the read / write latency as viewed from the VM whose database is running on the monitored zvol: I had heard that there was a patch to re-enable blktrace tracing for ZoL and was expecting it to be included in this candidate release - but it still shows only the Queue actions. Was I imagining this as cannot find reference to that patch now? I'll add the 0.6.4.2 blktrace in a gist later as that may be helpful, but the pattern is as above |
Gist of small sample of 0.6.4.2 blktrace trace while benchmarking occurring uploaded to: https://gist.github.com/koplover/6af0f58220d9df0b63819eed2c9500ec |
@koplover What's your kernel version? I've discovered a potential issue which appears to be exacerbated by the re-worked (bio based) zvol code on write-heavy workloads, particularly on systems with a lot of CPUs. There are a handful of variables involved but what seems to be happening is that without the old "zvol" taskq acting as a "buffering layer", a write-heavy workload can overload the write issue taskq to the point that the CPUs are spending most of their time contending for its Somewhat related, it's also a problem that the write issue taskq is dynamic because it increases the contention on This issue largely goes away, however, with b58986e and a sufficiently new kernel with 16K stacks, but no released version of ZoL yet has that patch. Your issue (@koplover) may have nothing to do with all this, however, it would be interesting to run As to the blktrace situation, you might be referring to 8198d18 which give accounting (iostat, etc.) but I don't think it will impact blktrace at all. Only a small number of other standard block devices in the kernel use @koplover Skimming your information posted above, the change in write latency between 0.6.4.2 and 0.7.0 is rather striking but it begs the question as to whether the pool really is able to deliver sub 2ms latency or whether something may have actually been broken in 0.6.4.2. Have you tried the new |
A few additional iostat dumps for this 0.7.0 RC1 build zpool iostat -r
so some very high latencies here! There are other 'background' operations going on which I'll try to tune off but the majority of writes in the period are from the test For zpool iostat -l 2 gives
Not really clear what these statistics mean as do not seem to be aggregated over the period given (2 seconds). If I run at a point later for smaller period (0.1) I see below, and as the traffic is fairly constant seems a little strange
Looking at the queues (new zpool iostat -q 2
|
@koplover Could you please enclose those dumps in triple backticks (```) to preserve the original formatting. |
@dweeezil Ah just read your post, was gathering some additional information posted rather messily above. The kernel is 3.19.0-39 The domain running zfs zvols runs with 4 vcpus, so not a large number What remains interesting from an external monitoring perspective of iostat on the zvol being stressed is the relatively high queue size for 0.6.4.2 (sometimes > 20 if memory serves), and the < 1 always seen with 0.6.5.x and 0.7 now. It feels like the newer zfs versions are faster, but additionally something else is stopping them being fed quickly enough by higher applications |
@dweezil done - apologies was dumping as the run was going and was going to look at formatting later |
@dweezil I've had a look at perf while the performance logging is going on, and pretty much 90% of the time is sat in the Xen scheduler. Locking does not seem to be an issue in this scenario, the greatest I saw was 0.15% in the highest spinlock routine. It would be really great to backport the new zpool iostat options back to 0.6.4.2 to compare what was going on there, I'll take a look at the feasibility. As I say, I've been trying to nail it by running fio tests, but have so far been unsuccessful with an array of sync / async and random / sequential read and write fio tests. I have a combined read / write job which shows an approximate match to the benchmark traffic when viewed through blktrace on a 0.6.4.2 system and this is showing some promise. Do not see the same disparity although write seems faster on 0.6.4.2 (read is faster on 0.7.0 in this combined test however). |
@koplover Regarding your observation of short queue lengths on iostat's |
NB, @koplover do you see one I/O queued per-zvol or per-pool? @dweeezil I think b58986e is a good idea, but not related to this issue |
@ironMann Interesting find. however, not strictly aligned with my issue I believe. My test systems I am using are just a single mirrored vdev, so not much scope for the unbalanced IOPS to the vdevs you describe. To be sure, and out of interest for how our system is affected by moving to metadata rather than all on the target zvol, I set it up for 0.6.4.2 and 0.7.0 [We do have other systems with 3 x mirror vdevs, so I'd assume your finding is very interesting for those and help in their performance in the future, unfortunately I do not have any to hand to test] In a 0.6.4.2 system it downgraded its performance to that I see when benchmarking a 0.7.0 system For the 0.7.0 ZFS it cut benchmark reported performance in half. This was also true for writes (although the benchmark is a mixed read/write as discussed earlier, in the initial phase there is a write only stage where the db is created). When monitoring this 0.6.5.x and 0.7.0 has always been much better during this pure write stage (>40% better than 0.6.4.2) - with the primarycache set to metadata it was approx 20% slower - I had not expected that to be the case |
@richardelling I'm not using zpool iostat but iostat on the zvol volume and the underlying sd[a-b] in the vdev mirror. I was talking about queue depth on the zvol which was being monitored which was always < 1 for 0.6.5.x and 0.7.0 but @dweezil indicates that this is to be expected with the move to bio-based block devices - which I was not aware of When monitoring the underlying disks, I need to regather the information and confirm, but I thought I saw here a significantly lower queue depth with 0.6.5.x / 0.7.0 and utilization compared to 0.6.4.2. However I'll rerun and post some real results |
Analysis during the performance benchmark. With 0.7.0 RC1 see the following (zpool is single mirror vdev containing sda and sdb
Note, I've switched back to primarycache=all on both systems, so the write phase back to significantly faster on 0.7.0 RC1, and in this case the queue depths were large (8-10) For 0.6.4.2 see improved utilization and queue depths hovering around 2
Here for the initial write phase (which does not contribute to the benchmark values), the queue depths were low < 1.5 |
Been out of the loop for a little while, does anybody want any further investigation on our system to look at this performance issue for zvols in virtualized environment. As outlined above, last tried 0.7.0 RC to see if it addressed any of the issues, with the same results on our benchmark |
I've built our test system with 0.7.0 RC3 with PR 5824 applied and the results are encouraging. There is certainly a reported uplift of performance for out benchmark results (JetStress), not quite at the reported values for 0.6.4.2 but way above what was reported for the various vanilla 0.6.5.x versions and vanilla 0.7.0 RC1 tested most recently For comparison, the various benchmark results were as follows: 0.7.0 RC1 0.7.0 RC3 with PR5824 applied The latest test is on the same physical server as the prior 0.7.0 RC1 test, with a single mirrored zpool (sda/sdb) the same configuration as all cases above I'll pass on the iostat analysis for this latest test, but is not longer stuck on a <1 average queue size |
Taking a look at the iostat on sda/sdb in the same way as above, the results during performance benchmarking stage are looking a lot better - aligned with the 0.6.4.2 results above with the average queue size no longer being constrained.
|
The various stages of the benchmark (database creation - write only phase, benchmarking - mixed read/write phase, and checksum checking - read only phase) provide some interesting numbers when compared with the original post (which compared 0.6.4.2 and 0.6.5.7 versions) Note, although the geometry of the tested server is the same (mirrored 2x disk) the disk sizes for those original tests were 4TB and this latest test is on 2 x 2TB Database creation phase (synchronous writes, no reads)
Benchmark measurement phase (mix read / write)
Checksum checking phase (read only phase, checking db integrity at the end of the benchmark)
|
@koplover I am actively working on this as part of my new job that I started last week. Expect to hear back from me within the week. |
@koplover Just as a FYI, that patch wasn't what I meant when I said that I am actively working on this. There are multiple changes in the pipeline, including pipelining ZIL itself. |
@ryao Sounds good, looking forward to seeing the fruits of this labour - are you using zvols in this new job? |
I've found time to do a little more testing, to compare 0.6.4.2 and 0.7.0 RC3 performance in our environment to act as a more robust data point. I realised looking back on my notes, that the 0.6.4.2 results from the start of this thread were a little unfair. We have a couple of server configurations in play here, one basic (2x2TB vdev with log and cache), and the other premium (2x4TB with log and cache device) Although the disk size should not significantly skew the results, the SSD and log partition used in both cases are different. For the basic we have 128GB SSD, and premium 256 GB SSD, the premium SSD write speed is sginificantly faster that the 128GB spec, and the partition assigned to the log vdev for the premium server is twice the size of that for the basic server. The original tests were all against a premium server, while my latest 0.7.0 RC3 + PR results were for a basic server. I've now got hold of another basic server and installed 0.6.4.2 on that so the servers for the following results are exactly the same spec, and the operations performed are equivalent synchronously following the initial restart of the server. |
Initial case taken back to the best known state possible.
There remain obvious differences, in particular the metaslab consumption as 0.7.0 system has been up for double the time, but both are relatively low. And the initial results show very close correlation for the two servers For this starting point, the basic results were as follows from the benchmark:
This went through the full phases of the test benchmark:
|
Detailed results for this run were as follows: Database Creation:
0.7.0 RC3 + PR example datapoint
The duration for this 48GB database write was: Performance Logging (time fixed for 30 minutes):
Example 0.7.0 RC3 + PR data point:
Obviously actual read and write values are variable around this point by 10-20% as the data points were only over 5s, probably should take a longer sample period for future tests Database checksum verification (read through the database checksums)
0.7.0 RC3 + PR example datapoint
The actual duration of the operation shows again that 0.7.0 RC3 + PR is the clear winner for sequential read, as below: So the performance for this sequential read is roughly double for 0.7.0 RC3 over 0.6.4.2 in this test. The arc stats are a little strange and do not seem on first glance back this read performance up. On 0.6.4.2 server, an arcstat sample was:
And for 0.7.0 RC3 + PR
The 0.7.0 RC3 seems to show a significant number of metadata misses compared with 0.6.4.2 - I'd expect looking only at these stats to see read performance for 0.7.0 significantly worse but that is not the case clearly given the measurements earlier - what is happening here? |
Can you double-check this? |
I reran the same test again immediately after the one above with pretty much exactly the same results. More interesting is when I then re-ran the tests omitting the first optional database creation (so the tests re-attached to the existing database on disk). Again this was commenced only a couple of minutes after the last set of tests completed. The difference in benchmarks results is significant in this case: |
I did not capture any detailed stats for the 'attach-only' run above, so retried the same test run next morning with deteriorating results on both servers, presumably as background processes evicted test data from the ARC This time the benchmark results gave: The details of this run are as follows: No additional database creation - re-attached to existing database from prior tests Performance Logging phase:
0.7.0 RC3 example data point:
Database checksum verification phase:
0.7.0 RC3 + PR example datapoint:
Again 0.7.0 giving better read performance, despite the arcstats seeminly implying that it is having a great deal of problems with metadata misses, which is normally an indicator of worse performance. A sample of the 0.7.0 RC3 arcstats during this period is:
@behlendorf |
Running the same test again immediately after this one completes raises the figures for both: 0.6.4.2 server: 481 transactional IOs 0.7.0 RC3 + PR server: 334 transactional IOs |
Last data point test I've tried, immediately after the above, run as for the first test (include the optional database creation sequential write phase) 0.6.4.2 server: 485 transactional IOs /s Again 0.6.4.2 returns to its initial value, whereas 0.7.0 is stuck 33% down on this original benchmark value Database creation phase:
0.7.0 RC3 + PR server
0.6.4.2 duration: 23 minutes 42 seconds Performance logging phase:
0.7.0 RC3 + PR server:
Database checksum verification phase
0.7.0 RC3 + PR server
This sequential read phase is again significantly faster: So again twice as fast in 0.7.0 RC3 + PR, even though the arcstats would imply issues to me:
|
In summary, these are my key observations from the above tests:
What significant changes have there been to the ARC / L2ARC caching that may explain this. Its not entirely a good answer anyway as the sequential read phase of db checksums is significantly faster in 0.7.0 RC3 + PR
@behlendorf, @dweezil - does the above make sense? Anything else it would be useful to find out from these two systems? @sempervictus I've looked at the write improvement suggestions fyi I've tried:
|
But to reiterate - results for 0.7.0 RC3 much better with this PR than without it on our system |
@lnicola yes thanks - that was a typo in the first set of performance logging results for 0.7.0 RC3 - pasted in the same sample twice! |
@koplover thanks for the testing results and confirming PR 5824 is a step in the right direction. If your testing system is still all ready it would be great if you could re-run the same tests using rc4 which was recently tagged. This release contains PR 5824 (enabled by default), additional zvol performance tweaks (5731140, bc17f10) and some changes to address issues observed with the ARC (9b50146, 30fffb9, b855550, 2e91c2f, 1a31dcf, 44813ae, 4149bf4, 03b60ee). You should see comparable performance to to rc3+PR 5824.
There's a significant amount of change between these two releases which could account for this. Compressed ARC, ABD, etc. But it's good to see we're competitive again with the older release. With @ryao focusing on further optimizing zvols I'm optimistic.
This may be the result of the recent compressed ARC feature. Not only is the ARC now compressed in memory, but the L2ARC is compressed on-disk and the size of the headers needed to track the L2ARC data have been reduced. The net effect is you're able to cache considerably more data which is great for reads.
This might be related to an ARC accounting leak recently fixed ( 423e7b6 ). It will be interesting to see if you observe this same behavior in rc4 where this has been addressed.
There are two values reported in
I can see why this is misleading. The meta miss rate is clearly up, but many more I/Os are being pushed through the system. It's making much better use of the l2arc device. Rarely do we need to go all the way back to disk according to the l2miss% which is down significantly. |
Far as the performance regression on reused blocks go (old vs new sets/vols), I don't think its the arc accounting leak as master from yesterday with our common stack (trim, missing snaps sends, etc) doesnt show improvement in the latency of access (r/w) to those old vols. Snaps make it worse, which makes me think there's a serialized/lenear blkptr resolution going on.
I imagine this is why dtrace was made, but I'd bet my left shoe that if we could graph linear task execution against concurrent tasks, we would see a ton of relative wait somewhere (probably in block selection from free maps on w and resolution of current logical state from snaps/freed blocks in r). Smells and sounds like a bottleneck, maybe in the dmu/spa, just can't tell if it looks like one when we can't see it.
|
@behlendorf Thanks for the explanation, my alias was using l2size so that explains the arc size, I'll use both the values in future to provide a better, fuller picture In terms of trying RC4, this probably means a server redeploy so will be slower time than just reusing the existing server, but will try in-place from the dailys first Are you intending to back port any of this performance enhancement into the 0.6.5.x stream, I was hoping to build that next as this stream should be more stable than a new 0.7.0, as need to get this out into our production servers as soon as possible. We are struggling with 0.6.4.2 instabilities in the field and want to take the stability enhancements without the performance penalties and with the least risk possible. 0.6.5.x (0.6.5.10?) gives the best compromise of risk v timing from our perspective - I assume 0.7.0 is still some way off, and when released will have teething issues despite the additional testing from the community. |
Closed as stale due to actual 0.8 stable branch. |
We have been using zfs for a couple of years in our xen virtualized environment. Initially we used 0.6.3 before moving to 0.6.4.2 running within our disk driver domain.
We have been experiencing various lockup issues in these releases, impacted by memory pressure, and impacting our guest workloads and end user experience, and have been trying to move to the 0.6.5 stream for the last 6 months to improve stability. Unfortunately, we experience performance issues when moving to 0.6.5
With two identical systems (hardware and software) except zfs version (0.6.4.2 and 0.6.5.7), I run a benchmark (Microsoft JetStress) on a Windows Xen guest to establish disk subsystem performance to understand performance for an Exchange DB. Running this benchmark against 0.6.4.2 (Achieved Transactional I/O per Second 464.083) shows a far better performance figure than the configuration using 0.6.5.7 (Achieved Transactional I/O per Second 226.203)
In these benchmarks an Exchange Database is created and manipulated in a separate volume of the Windows guest VM, formatted as 64K block NTFS, supported underlying by a 64k block zvol. From monitoring this underlying zvol using iostat, it seems something has changed in the implementation of the zvols which means that concurrency at the disk layer is impaired (the queue size in the 0.6.5.7 deployment for the actual measurement stage is just below 1 when viewed with iostat).
Looking at the various stages of the benchmark and the impact on the zvol via iostat I see the following:
Database creation phase (synchronous writes, no reads)
0.6.5.7 - example sample
0.6.4.2 - example sample
So write performance on this benchmark initialisation stage significantly better for 0.6.5.7 compared to 0.6.4.2 deployment
Benchmark measurement phase (mix read / write)
0.6.5.7 - example sample
0.6.4.2 - example sample
Read / write throughput significantly lower here, I suspect read difference leading the impact on write. Average queue size not huge for 0.6.4.2 but > 1
Checksum checking phase (read only phase, checking db integrity at the end of the benchmark)
0.6.5.7 - example sample
0.6.4.2 - example sample
Stats for 0.6.4.2 looks worse for 0.6.4.2, with the exception of read KB/s which is better. This is a sequentual read (mostly) 128 blocks at a time through the db to perform the verification. I assume the average queue size is allow for greater concurrency lower down the stack - and for some reason in 0.6.5.7 there is some form of serialization maintaining the queue size below 1 here?
Some further information on the deployments:
0.6.5.7
VM running zfs:
8GB, 4 vcpus
errors: No known data errors
SLOG is 5GB
L2-cache is 40GB
arcstats (note not when running, taking later)
cat /proc/spl/kstat/zfs/arcstats
The zfs parameters are as follows:
for i in *; do echo$i $ (cat $i); done
cat /proc/spl/kstat/zfs/dmu_tx
0.6.4.2
VM running zfs:
8GB, 4 vcpus
SLOG is 5GB
L2-cache is 40GB
The zfs parameters are as follows:
for i in *; do echo$i $ (cat $i); done
Unfortunately, the machine was redeployed so I lost this information, I'll rerun the benchmark and post again when the details below are available
cat /proc/spl/kstat/zfs/dmu_tx
arcstats
cat /proc/spl/kstat/zfs/arcstats
The text was updated successfully, but these errors were encountered: