RAID-Z: Increase read IOPS of metadata blocks #5095

ironMann · 2016-09-13T09:28:41Z

If IO fits in a single block, then corresponding raidz block consists of (nparity+1)
blocks. Additionally, all parity blocks are equal to the data block, effectively
providing mirroring. This is expected and desired behavior, since it preserves
reliability criteria of selected raid-z level, and should enable better IOPS for
small metadata blocks.

This can be exploited in the read path. Such block are usually metadata and
by selecting the least loaded child vdev from (nparity + 1) should yield more
IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from.
Total number of child devices does not matter.

This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require
on-disk data change.

Additionally:

Heuristics from vdev_mirror 9f50093
mark last offset position of vdev in zio_vdev_child_io(). This should better
track offset position for all reads/writes, hopefully improving heuristics for
both mirror and raidz.

DeHackEd · 2016-09-13T10:03:31Z

module/zfs/vdev_raidz.c

+
+/* Non-rotating media load calculation configuration. */
+extern int zfs_vdev_mirror_non_rotating_inc;
+extern int zfs_vdev_mirror_non_rotating_seek_inc;


The more appropriate style is to put extern variable declarations in a (new, if necessary) header to be included by both the mirror and raidz code. This means the compiler will complain in case anybody does anything funny like change the variable type in only one file.

Yea, this is definitely not the right place for these. I wanted to keep the changes contained in the first iteration.
These tunables are kinda out of place in the raidz code, but since this is a sort of mirroring situation, I think they could be used here as well. Maybe doc would need updating to reflect this.

While I hate the idea of even more tunings it may make the most sense to add a new set of zfs_vdev_raidz_* options. I can imagine a situation where a system has two pools, one mirrored and one raidz. For example, an SSD mirrored root pool and a HDD based raidz primary store may need to be tuned differently.

@behlendorf or with the hybrid pool from @don-brady there may be both SSD VDEVs and HDD VDEVs in the same pool.

I'll wait for performance data to see what tuning is relevant here. We might be fine with just looking at vdev queue size here. Maybe penalizing rotational vdevs by some factor would be enough.

behlendorf · 2016-09-13T18:15:52Z

This is awesome, I like how you've generalized the existing mirror logic in to zio_vdev_child_io(). This whole change ended up bring much smaller than I would have guessed and fell out really cleanly.

Do you have any early performance results which show that this is actually a performance win? One potentially good test case would be to do a find in a large directory which is cold. Assuming the fatzap leaves are small enough, they may exceed a single block. Another would be to use xattrtest found in ./tests/zfs-tests/cmd/xattrtest/, when xattr=on you'll generate a ton of small objects if you set one small xattr-per-file.

The Amazon SIMD test failure was unrelated, resubmitting test run.

adilger · 2016-09-13T18:25:58Z

@ironMann I don't think this will replace the work that @don-brady is doing on hybrid pools, because that will allow both SSD and HDD VDEVs in the same pool, but it is definitely interesting regardless. The balancing of IO load is something that can definitely help in multiple ways. Not only for the single IO request for the mirrored block, but also to reduce the load on other devices that might be a bit slower so they can do something else, instead of piling more work on them.

ironMann · 2016-09-14T10:30:50Z

@behlendorf This is still very wip. I wanted to check if we can exploit this mirroring without changing the allocator. I was only able to confirm that bonnie++ test case with large dir tree is going through this code. I still have to check if rest of raidz reliability code will work properly in case of cksum error since the parity is accessed first. Also, if there are any consequences to the whole zio pipeline. I'm somewhat surprised how well this tested.
I'll run some tests when I find suitable HDD test hardware. I'll focus on correctness first.

ironMann · 2016-09-14T10:57:07Z

@adilger This is only intended to exploit one specific case where metadata might be mirrored in raid-z pool. In theory this looks good, but I still have to get real performance measurements in. Hopefully a good deal of metadata heavy workloads would see benefit from this IO balancing.

Pushing the balancing to extremes, with much faster vectorized raid-z operations, it's possible to load balance every read operation by reading parity vdevs instead of data from backlogged vdevs. Missing data reconstruction will add some latency (algorithm can be made incremental to reduce latency), but not as much as waiting for overloaded vdev. Just an idea.

thegreatgazoo · 2016-09-21T05:24:25Z

I agree it's a good idea to proactively use redundancy for load balancing. But I want to see some data. The reason why raidz[2-3] always store parity on the first 2-3 drives is that parity would still be spread over all drives since blocks can begin at any child drive - see the big block of comment in vdev_raidz_map_alloc(). The comment actually argued that there was not sufficient proof that it was a good idea to alternate parity drives for raidz1.

On the other hand, we might exploit it to optimize the write path, for write zio no larger than 1 << ashift:

In vdev_raidz_map_alloc(), don't even bother to allocate buffer/abd for parity.
Skip parity generation and write the data buffer/abd instead.

ironMann · 2016-09-21T06:23:06Z

@thegreatgazoo You're right about allocations. I already started optimizing this in a separate patch. But the worst case of unneeded parity buffers is normal read (without failed devices).

ahrens · 2017-01-13T01:06:04Z

I agree with what @thegreatgazoo said.

Any improvement in iops is based on there being uneven load on the devices in the RAIDZ group, and vdev_raidz_child_load() being able to notice and counteract that unevenness.

Given a pool with uneven sized blocks, the blocks will tend to be spread over the devices in an uneven manner, and thus under moderate load all devices will be roughly evenly busy. At least, that's my theory. Gathering data that shows otherwise would be enlightening :-)

For the write path, were you able to measure the performance improvement (e.g. CPU usage) of not having to compute the parity? That seems like it could be a good win for single-sector blocks, but those blocks typically constitute a small % of the data written.

ironMann · 2017-01-13T22:19:38Z

@ahrens
I've only managed to run a read test on a 10 HDD raid-z2 pool. The test consisted of grepping through large, predominantly small file, directory tree; as suggested by @behlendorf in comment. I've seen 10% to 15% wall time improvement, depending on setup. Also, during the test, %util of all drives went to 100% during the whole test (otherwise between 95% and 100%), indicating better load balancing.

I have not seen noticeable difference for write path, since writes take much longer than allocation and memcopy. This might be more relevant for non rotational devices. Luckily, it is a simple enough change for the write path.

Lastly, I went into refactoring of vdev_raidz_io_done() and vdev_raidz_combrec() to ensure that a read failure on 'mirrored' devices is handled properly. Unfortunately, I had to push this down my prio queue for now, hopefully to be continued soon...

If IO fits in a single block, then corresponding raidz block consists of (nparity+1) blocks. Additionally, all parity blocks are equal to the data block, effectively providing mirroring. This is expected and desired behavior, since it preserves reliability criteria of selected raidz level, and should enable better IOPS for small metadata blocks, etc. This can be exploited in the read path. Such block are usually metadata and by selecting the least loaded child vdev from (nparity + 1) should yield more IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from. Total number of child devices does not matter. This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require on-disk data change. Additionally: - Heuristics from vdev_mirror 9f50093 - Use vq_seek_offset position, set in `vdev_queue_io_to_issue(). Remove duplicate vq_lastoffset field of struct vdev_queue. - mirror: Fix (CID 147541) Macro ABS() compares unsigned to 0 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>

codecov · 2018-01-10T08:18:21Z

Codecov Report

Merging #5095 into master will decrease coverage by 3.28%.
The diff coverage is 92.72%.

@@            Coverage Diff             @@
##           master    #5095      +/-   ##
==========================================
- Coverage   75.26%   71.97%   -3.29%     
==========================================
  Files         296      278      -18     
  Lines       95516    94024    -1492     
==========================================
- Hits        71888    67673    -4215     
- Misses      23628    26351    +2723

Flag	Coverage Δ
#kernel	`70.1% <94.05%> (-4.49%)`	⬇️
#user	`65.77% <84.25%> (-1.69%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cba6fc6...e637627. Read the comment docs.

behlendorf · 2018-06-05T00:02:38Z

@ironMann if you're still interested in working on this proposed performance optimization please go ahead and open a new PR whenever you're ready. In the meanwhile, I'm going to close this PR.

thegreatgazoo · 2018-06-05T20:53:58Z

Just a note for those interested that P Q R blocks equal D0 when there's only 1 data block is because the coefficients for D0 in P/Q/R computation are all 1.

jumbi77 · 2018-09-09T21:12:19Z

@thegreatgazoo Are you maybe able and willing to pick up and finish the PR? That would be really great. Sadly i guess @ironMann isnt working on zfs anymore.

But much thanks to both of you anyway, i really appreciate your contributions.

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 2 times, most recently from 024c889 to d16eace Compare September 13, 2016 09:44

DeHackEd reviewed Sep 13, 2016
View reviewed changes

ironMann mentioned this pull request Sep 13, 2016

RAID-Z Mirror Hybrid Allocator #5096

Open

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch from d16eace to 340c921 Compare September 13, 2016 11:34

behlendorf added the Type: Performance Performance improvement or performance problem label Sep 13, 2016

behlendorf added this to the 0.7.0 milestone Sep 13, 2016

ironMann changed the title ~~[RFC] RAID-Z: Increase read IOPS of metadata blocks~~ [WIP,RFC] RAID-Z: Increase read IOPS of metadata blocks Sep 14, 2016

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch from 340c921 to 1f88023 Compare September 15, 2016 21:35

behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 5 times, most recently from 8f10727 to 3171ee2 Compare November 8, 2016 18:08

behlendorf modified the milestones: 0.8.0, 0.7.0 Jun 6, 2017

ironMann mentioned this pull request Aug 8, 2017

Mirror load balancing fixes #6461

Closed

13 tasks

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 4 times, most recently from 44167c6 to b75f3ff Compare September 15, 2017 15:27

ironMann added 2 commits January 10, 2018 00:44

debug metrics

e637627

ironMann force-pushed the raidz_hybrid_kinda_proto1 branch from b75f3ff to e637627 Compare January 9, 2018 23:45

behlendorf changed the title ~~[WIP,RFC] RAID-Z: Increase read IOPS of metadata blocks~~ RAID-Z: Increase read IOPS of metadata blocks Jan 10, 2018

behlendorf closed this Jun 5, 2018

jumbi77 mentioned this pull request Mar 21, 2021

add async dmu support #10377

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAID-Z: Increase read IOPS of metadata blocks #5095

RAID-Z: Increase read IOPS of metadata blocks #5095

ironMann commented Sep 13, 2016

DeHackEd Sep 13, 2016

ironMann Sep 13, 2016

behlendorf Sep 13, 2016

adilger Sep 13, 2016

ironMann Sep 14, 2016

behlendorf commented Sep 13, 2016 •

edited

Loading

adilger commented Sep 13, 2016

ironMann commented Sep 14, 2016

ironMann commented Sep 14, 2016

thegreatgazoo commented Sep 21, 2016

ironMann commented Sep 21, 2016

ahrens commented Jan 13, 2017

ironMann commented Jan 13, 2017

codecov bot commented Jan 10, 2018

behlendorf commented Jun 5, 2018

thegreatgazoo commented Jun 5, 2018

jumbi77 commented Sep 9, 2018

RAID-Z: Increase read IOPS of metadata blocks #5095

RAID-Z: Increase read IOPS of metadata blocks #5095

Conversation

ironMann commented Sep 13, 2016

DeHackEd Sep 13, 2016

Choose a reason for hiding this comment

ironMann Sep 13, 2016

Choose a reason for hiding this comment

behlendorf Sep 13, 2016

Choose a reason for hiding this comment

adilger Sep 13, 2016

Choose a reason for hiding this comment

ironMann Sep 14, 2016

Choose a reason for hiding this comment

behlendorf commented Sep 13, 2016 • edited Loading

adilger commented Sep 13, 2016

ironMann commented Sep 14, 2016

ironMann commented Sep 14, 2016

thegreatgazoo commented Sep 21, 2016

ironMann commented Sep 21, 2016

ahrens commented Jan 13, 2017

ironMann commented Jan 13, 2017

codecov bot commented Jan 10, 2018

Codecov Report

behlendorf commented Jun 5, 2018

thegreatgazoo commented Jun 5, 2018

jumbi77 commented Sep 9, 2018

behlendorf commented Sep 13, 2016 •

edited

Loading