Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAID-Z: Increase read IOPS of metadata blocks #5095

Closed
wants to merge 2 commits into from

Conversation

ironMann
Copy link
Contributor

If IO fits in a single block, then corresponding raidz block consists of (nparity+1)
blocks. Additionally, all parity blocks are equal to the data block, effectively
providing mirroring. This is expected and desired behavior, since it preserves
reliability criteria of selected raid-z level, and should enable better IOPS for
small metadata blocks.

This can be exploited in the read path. Such block are usually metadata and
by selecting the least loaded child vdev from (nparity + 1) should yield more
IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from.
Total number of child devices does not matter.

This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require
on-disk data change.

Additionally:

  • Heuristics from vdev_mirror 9f50093
  • mark last offset position of vdev in zio_vdev_child_io(). This should better
    track offset position for all reads/writes, hopefully improving heuristics for
    both mirror and raidz.

@ironMann ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 2 times, most recently from 024c889 to d16eace Compare September 13, 2016 09:44

/* Non-rotating media load calculation configuration. */
extern int zfs_vdev_mirror_non_rotating_inc;
extern int zfs_vdev_mirror_non_rotating_seek_inc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more appropriate style is to put extern variable declarations in a (new, if necessary) header to be included by both the mirror and raidz code. This means the compiler will complain in case anybody does anything funny like change the variable type in only one file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is definitely not the right place for these. I wanted to keep the changes contained in the first iteration.
These tunables are kinda out of place in the raidz code, but since this is a sort of mirroring situation, I think they could be used here as well. Maybe doc would need updating to reflect this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I hate the idea of even more tunings it may make the most sense to add a new set of zfs_vdev_raidz_* options. I can imagine a situation where a system has two pools, one mirrored and one raidz. For example, an SSD mirrored root pool and a HDD based raidz primary store may need to be tuned differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf or with the hybrid pool from @don-brady there may be both SSD VDEVs and HDD VDEVs in the same pool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for performance data to see what tuning is relevant here. We might be fine with just looking at vdev queue size here. Maybe penalizing rotational vdevs by some factor would be enough.

@behlendorf
Copy link
Contributor

behlendorf commented Sep 13, 2016

This is awesome, I like how you've generalized the existing mirror logic in to zio_vdev_child_io(). This whole change ended up bring much smaller than I would have guessed and fell out really cleanly.

Do you have any early performance results which show that this is actually a performance win? One potentially good test case would be to do a find in a large directory which is cold. Assuming the fatzap leaves are small enough, they may exceed a single block. Another would be to use xattrtest found in ./tests/zfs-tests/cmd/xattrtest/, when xattr=on you'll generate a ton of small objects if you set one small xattr-per-file.

The Amazon SIMD test failure was unrelated, resubmitting test run.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 13, 2016
@behlendorf behlendorf added this to the 0.7.0 milestone Sep 13, 2016
@adilger
Copy link
Contributor

adilger commented Sep 13, 2016

@ironMann I don't think this will replace the work that @don-brady is doing on hybrid pools, because that will allow both SSD and HDD VDEVs in the same pool, but it is definitely interesting regardless. The balancing of IO load is something that can definitely help in multiple ways. Not only for the single IO request for the mirrored block, but also to reduce the load on other devices that might be a bit slower so they can do something else, instead of piling more work on them.

@ironMann
Copy link
Contributor Author

@behlendorf This is still very wip. I wanted to check if we can exploit this mirroring without changing the allocator. I was only able to confirm that bonnie++ test case with large dir tree is going through this code. I still have to check if rest of raidz reliability code will work properly in case of cksum error since the parity is accessed first. Also, if there are any consequences to the whole zio pipeline. I'm somewhat surprised how well this tested.
I'll run some tests when I find suitable HDD test hardware. I'll focus on correctness first.

@ironMann
Copy link
Contributor Author

@adilger This is only intended to exploit one specific case where metadata might be mirrored in raid-z pool. In theory this looks good, but I still have to get real performance measurements in. Hopefully a good deal of metadata heavy workloads would see benefit from this IO balancing.

Pushing the balancing to extremes, with much faster vectorized raid-z operations, it's possible to load balance every read operation by reading parity vdevs instead of data from backlogged vdevs. Missing data reconstruction will add some latency (algorithm can be made incremental to reduce latency), but not as much as waiting for overloaded vdev. Just an idea.

@ironMann ironMann changed the title [RFC] RAID-Z: Increase read IOPS of metadata blocks [WIP,RFC] RAID-Z: Increase read IOPS of metadata blocks Sep 14, 2016
@ironMann ironMann force-pushed the raidz_hybrid_kinda_proto1 branch from 340c921 to 1f88023 Compare September 15, 2016 21:35
@thegreatgazoo
Copy link

I agree it's a good idea to proactively use redundancy for load balancing. But I want to see some data. The reason why raidz[2-3] always store parity on the first 2-3 drives is that parity would still be spread over all drives since blocks can begin at any child drive - see the big block of comment in vdev_raidz_map_alloc(). The comment actually argued that there was not sufficient proof that it was a good idea to alternate parity drives for raidz1.

On the other hand, we might exploit it to optimize the write path, for write zio no larger than 1 << ashift:

  1. In vdev_raidz_map_alloc(), don't even bother to allocate buffer/abd for parity.
  2. Skip parity generation and write the data buffer/abd instead.

@ironMann
Copy link
Contributor Author

@thegreatgazoo You're right about allocations. I already started optimizing this in a separate patch. But the worst case of unneeded parity buffers is normal read (without failed devices).

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@ironMann ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 5 times, most recently from 8f10727 to 3171ee2 Compare November 8, 2016 18:08
@ahrens
Copy link
Member

ahrens commented Jan 13, 2017

I agree with what @thegreatgazoo said.

Any improvement in iops is based on there being uneven load on the devices in the RAIDZ group, and vdev_raidz_child_load() being able to notice and counteract that unevenness.

Given a pool with uneven sized blocks, the blocks will tend to be spread over the devices in an uneven manner, and thus under moderate load all devices will be roughly evenly busy. At least, that's my theory. Gathering data that shows otherwise would be enlightening :-)

For the write path, were you able to measure the performance improvement (e.g. CPU usage) of not having to compute the parity? That seems like it could be a good win for single-sector blocks, but those blocks typically constitute a small % of the data written.

@ironMann
Copy link
Contributor Author

@ahrens
I've only managed to run a read test on a 10 HDD raid-z2 pool. The test consisted of grepping through large, predominantly small file, directory tree; as suggested by @behlendorf in comment. I've seen 10% to 15% wall time improvement, depending on setup. Also, during the test, %util of all drives went to 100% during the whole test (otherwise between 95% and 100%), indicating better load balancing.

I have not seen noticeable difference for write path, since writes take much longer than allocation and memcopy. This might be more relevant for non rotational devices. Luckily, it is a simple enough change for the write path.

Lastly, I went into refactoring of vdev_raidz_io_done() and vdev_raidz_combrec() to ensure that a read failure on 'mirrored' devices is handled properly. Unfortunately, I had to push this down my prio queue for now, hopefully to be continued soon...

@behlendorf behlendorf modified the milestones: 0.8.0, 0.7.0 Jun 6, 2017
@ironMann ironMann mentioned this pull request Aug 8, 2017
13 tasks
@ironMann ironMann force-pushed the raidz_hybrid_kinda_proto1 branch 4 times, most recently from 44167c6 to b75f3ff Compare September 15, 2017 15:27
If IO fits in a single block, then corresponding raidz block consists of (nparity+1)
blocks. Additionally, all parity blocks are equal to the data block, effectively
providing mirroring. This is expected and desired behavior, since it preserves
reliability criteria of selected raidz level, and should enable better IOPS for
small metadata blocks, etc.

This can be exploited in the read path. Such block are usually metadata and
by selecting the least loaded child vdev from (nparity + 1) should yield more
IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from.
Total number of child devices does not matter.

This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require
on-disk data change.

Additionally:
- Heuristics from vdev_mirror 9f50093

- Use vq_seek_offset position, set in `vdev_queue_io_to_issue(). Remove duplicate
vq_lastoffset field of struct vdev_queue.

- mirror: Fix (CID 147541) Macro ABS() compares unsigned to 0

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
@ironMann ironMann force-pushed the raidz_hybrid_kinda_proto1 branch from b75f3ff to e637627 Compare January 9, 2018 23:45
@codecov
Copy link

codecov bot commented Jan 10, 2018

Codecov Report

Merging #5095 into master will decrease coverage by 3.28%.
The diff coverage is 92.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5095      +/-   ##
==========================================
- Coverage   75.26%   71.97%   -3.29%     
==========================================
  Files         296      278      -18     
  Lines       95516    94024    -1492     
==========================================
- Hits        71888    67673    -4215     
- Misses      23628    26351    +2723
Flag Coverage Δ
#kernel 70.1% <94.05%> (-4.49%) ⬇️
#user 65.77% <84.25%> (-1.69%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cba6fc6...e637627. Read the comment docs.

@behlendorf behlendorf changed the title [WIP,RFC] RAID-Z: Increase read IOPS of metadata blocks RAID-Z: Increase read IOPS of metadata blocks Jan 10, 2018
@behlendorf
Copy link
Contributor

@ironMann if you're still interested in working on this proposed performance optimization please go ahead and open a new PR whenever you're ready. In the meanwhile, I'm going to close this PR.

@behlendorf behlendorf closed this Jun 5, 2018
@thegreatgazoo
Copy link

Just a note for those interested that P Q R blocks equal D0 when there's only 1 data block is because the coefficients for D0 in P/Q/R computation are all 1.

@jumbi77
Copy link
Contributor

jumbi77 commented Sep 9, 2018

@thegreatgazoo Are you maybe able and willing to pick up and finish the PR? That would be really great. Sadly i guess @ironMann isnt working on zfs anymore.

But much thanks to both of you anyway, i really appreciate your contributions.

@jumbi77 jumbi77 mentioned this pull request Mar 21, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants