-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAID-Z: Increase read IOPS of metadata blocks #5095
Conversation
024c889
to
d16eace
Compare
module/zfs/vdev_raidz.c
Outdated
|
||
/* Non-rotating media load calculation configuration. */ | ||
extern int zfs_vdev_mirror_non_rotating_inc; | ||
extern int zfs_vdev_mirror_non_rotating_seek_inc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more appropriate style is to put extern
variable declarations in a (new, if necessary) header to be included by both the mirror and raidz code. This means the compiler will complain in case anybody does anything funny like change the variable type in only one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is definitely not the right place for these. I wanted to keep the changes contained in the first iteration.
These tunables are kinda out of place in the raidz code, but since this is a sort of mirroring situation, I think they could be used here as well. Maybe doc would need updating to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I hate the idea of even more tunings it may make the most sense to add a new set of zfs_vdev_raidz_*
options. I can imagine a situation where a system has two pools, one mirrored and one raidz. For example, an SSD mirrored root pool and a HDD based raidz primary store may need to be tuned differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf or with the hybrid pool from @don-brady there may be both SSD VDEVs and HDD VDEVs in the same pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for performance data to see what tuning is relevant here. We might be fine with just looking at vdev queue size here. Maybe penalizing rotational vdevs by some factor would be enough.
d16eace
to
340c921
Compare
This is awesome, I like how you've generalized the existing mirror logic in to Do you have any early performance results which show that this is actually a performance win? One potentially good test case would be to do a The Amazon SIMD test failure was unrelated, resubmitting test run. |
@ironMann I don't think this will replace the work that @don-brady is doing on hybrid pools, because that will allow both SSD and HDD VDEVs in the same pool, but it is definitely interesting regardless. The balancing of IO load is something that can definitely help in multiple ways. Not only for the single IO request for the mirrored block, but also to reduce the load on other devices that might be a bit slower so they can do something else, instead of piling more work on them. |
@behlendorf This is still very wip. I wanted to check if we can exploit this mirroring without changing the allocator. I was only able to confirm that |
@adilger This is only intended to exploit one specific case where metadata might be mirrored in raid-z pool. In theory this looks good, but I still have to get real performance measurements in. Hopefully a good deal of metadata heavy workloads would see benefit from this IO balancing. Pushing the balancing to extremes, with much faster vectorized raid-z operations, it's possible to load balance every read operation by reading parity vdevs instead of data from backlogged vdevs. Missing data reconstruction will add some latency (algorithm can be made incremental to reduce latency), but not as much as waiting for overloaded vdev. Just an idea. |
340c921
to
1f88023
Compare
I agree it's a good idea to proactively use redundancy for load balancing. But I want to see some data. The reason why raidz[2-3] always store parity on the first 2-3 drives is that parity would still be spread over all drives since blocks can begin at any child drive - see the big block of comment in vdev_raidz_map_alloc(). The comment actually argued that there was not sufficient proof that it was a good idea to alternate parity drives for raidz1. On the other hand, we might exploit it to optimize the write path, for write zio no larger than 1 << ashift:
|
@thegreatgazoo You're right about allocations. I already started optimizing this in a separate patch. But the worst case of unneeded parity buffers is normal read (without failed devices). |
8f10727
to
3171ee2
Compare
I agree with what @thegreatgazoo said. Any improvement in iops is based on there being uneven load on the devices in the RAIDZ group, and Given a pool with uneven sized blocks, the blocks will tend to be spread over the devices in an uneven manner, and thus under moderate load all devices will be roughly evenly busy. At least, that's my theory. Gathering data that shows otherwise would be enlightening :-) For the write path, were you able to measure the performance improvement (e.g. CPU usage) of not having to compute the parity? That seems like it could be a good win for single-sector blocks, but those blocks typically constitute a small % of the data written. |
@ahrens I have not seen noticeable difference for write path, since writes take much longer than allocation and memcopy. This might be more relevant for non rotational devices. Luckily, it is a simple enough change for the write path. Lastly, I went into refactoring of |
44167c6
to
b75f3ff
Compare
If IO fits in a single block, then corresponding raidz block consists of (nparity+1) blocks. Additionally, all parity blocks are equal to the data block, effectively providing mirroring. This is expected and desired behavior, since it preserves reliability criteria of selected raidz level, and should enable better IOPS for small metadata blocks, etc. This can be exploited in the read path. Such block are usually metadata and by selecting the least loaded child vdev from (nparity + 1) should yield more IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from. Total number of child devices does not matter. This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require on-disk data change. Additionally: - Heuristics from vdev_mirror 9f50093 - Use vq_seek_offset position, set in `vdev_queue_io_to_issue(). Remove duplicate vq_lastoffset field of struct vdev_queue. - mirror: Fix (CID 147541) Macro ABS() compares unsigned to 0 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
b75f3ff
to
e637627
Compare
Codecov Report
@@ Coverage Diff @@
## master #5095 +/- ##
==========================================
- Coverage 75.26% 71.97% -3.29%
==========================================
Files 296 278 -18
Lines 95516 94024 -1492
==========================================
- Hits 71888 67673 -4215
- Misses 23628 26351 +2723
Continue to review full report at Codecov.
|
@ironMann if you're still interested in working on this proposed performance optimization please go ahead and open a new PR whenever you're ready. In the meanwhile, I'm going to close this PR. |
Just a note for those interested that P Q R blocks equal D0 when there's only 1 data block is because the coefficients for D0 in P/Q/R computation are all 1. |
@thegreatgazoo Are you maybe able and willing to pick up and finish the PR? That would be really great. Sadly i guess @ironMann isnt working on zfs anymore. But much thanks to both of you anyway, i really appreciate your contributions. |
If IO fits in a single block, then corresponding raidz block consists of (nparity+1)
blocks. Additionally, all parity blocks are equal to the data block, effectively
providing mirroring. This is expected and desired behavior, since it preserves
reliability criteria of selected raid-z level, and should enable better IOPS for
small metadata blocks.
This can be exploited in the read path. Such block are usually metadata and
by selecting the least loaded child vdev from (nparity + 1) should yield more
IOPS. The higher the RAID-Z level, the more "mirror" devices there are to choose from.
Total number of child devices does not matter.
This optimization seems similar to "RAIDZ Hybrid Allocator", but does not require
on-disk data change.
Additionally:
track offset position for all reads/writes, hopefully improving heuristics for
both mirror and raidz.