-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating gang ABDs for Raidz optional IOs #12333
Conversation
8849c6d
to
2a4025d
Compare
I am not very deep in raidz code, but it seems to me that previous code marked optional I/Os as such, and they could be skipped if there is nothing for them to aggregate with. Your code sends all writes as mandatory. In case of large block where every disk get some traffic it is probably not important and can be counted as the price of the optimization, but in case of very small block some disks may get padding writes that may be not needed at all. Am I wrong? |
Yes, this is true. So my thought here was to follow the idea of dRAID where skip sectors are just added to the gang ABD. With the O_DIRECT PR #10018, I noticed dRAID was out performing raidz. Looking at flame graphs, raidz IO's were waiting on the children VDEV's However, you make a good point and I went ahead and update the case of: |
2a4025d
to
848d07a
Compare
43ff258
to
db51013
Compare
db51013
to
097ad0a
Compare
097ad0a
to
c4948fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 more nits and I'll be satisfied. ;)
c4948fe
to
dccc659
Compare
dccc659
to
b1faf12
Compare
b1faf12
to
d6d71d5
Compare
@bwatkinson Thanks. But just to be sure, I've just noticed that vdev_raidz_child_done() is also used by draid, which also uses gangs during write. Have you checked that this change won't affect draid? |
d6d71d5
to
f0723f0
Compare
Good point. I went ahead and updated this again. I created a |
This seems pretty handy - @amotin has it as approved, but it doesn't carry the relevant tag yet. @behlendorf: are there any blockers/concerns for merging? |
I would like to see a second review, by someone familiar with the RAIDZ and dRAID code paths (probably @behlendorf or me). |
4ac8918
to
2c09b66
Compare
2c09b66
to
50a4fea
Compare
50a4fea
to
7dc7156
Compare
@ahrens would you be willing to review this as well? I wasn't sure how this might affect the raidz expansion code. |
@bwatkinson a rebase of this is probably in order to a fresh test run. @mmaybee since you're familiar with this part of the code can you review this. |
7dc7156
to
1b0091f
Compare
@behlendorf I went ahead and did a rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have a few nit comments.
module/zfs/vdev_raidz.c
Outdated
/* | ||
* Generate optional I/O for single skip sector to improve | ||
* aggregation contiguity. | ||
*/ | ||
if (rc->rc_size == 0) { | ||
ASSERT3P(rc->rc_abd, ==, NULL); | ||
zio_nowait(zio_vdev_child_io(zio, NULL, cvd, | ||
rc->rc_offset, NULL, 1ULL << ashift, | ||
zio->io_type, zio->io_priority, | ||
ZIO_FLAG_NODATA | ZIO_FLAG_OPTIONAL, NULL, | ||
NULL)); | ||
} else { | ||
ASSERT3P(rc->rc_abd, !=, NULL); | ||
zio_nowait(zio_vdev_child_io(zio, NULL, cvd, | ||
rc->rc_offset, rc->rc_abd, | ||
abd_get_size(rc->rc_abd), zio->io_type, | ||
zio->io_priority, 0, vdev_raidz_child_done, rc)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ordering here is a little confusing. The primary purpose of this loop is to issue the IOs for the "normal" raid children, and then there is a "special case" if the child is a skip sector. Could we write this like:
if (rc->rc_size > 0) {
...assert and nowait the child io...
} else {
/* Generate optional write for skip sector */
...assert and nowait child io...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I went ahead and updated this.
module/zfs/vdev_raidz.c
Outdated
ASSERT3U(rm->rm_nrows, ==, 1); | ||
|
||
/* | ||
* We will pad any parity columns with additional space to account for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop the "We will": Pad any parity columns with additional space to account for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this.
module/zfs/vdev_raidz.c
Outdated
* a single I/O that contains the data and skip sector. | ||
* | ||
* It is important to make sure that rc_size is not updated | ||
* even though we are adding a skip sector to the ABD. With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: With -> When
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this.
module/zfs/vdev_raidz.c
Outdated
* It is important to make sure that rc_size is not updated | ||
* even though we are adding a skip sector to the ABD. With | ||
* calculating the parity in vdev_raidz_generate_parity_row() | ||
* the rc_size is used for iterating through the ABD's. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for iterating -> to iterate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this.
In order to reduce contention on the vq_lock, optional skip sectors for Raidz writes can be placed into a single IO request. This is done by padding out the linear ABD for a parity column to contain the skip sector and by creating gang ABD to contain the data and skip sector for data columns. The vdev_raidz_map_alloc() function now contains specific functions for both reads and write to allocate the ABD's that will be issued down to the VDEV chldren. Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
1b0091f
to
de3ed2d
Compare
In order to reduce contention on the vq_lock, optional skip sectors for Raidz writes can be placed into a single IO request. This is done by padding out the linear ABD for a parity column to contain the skip sector and by creating gang ABD to contain the data and skip sector for data columns. The vdev_raidz_map_alloc() function now contains specific functions for both reads and write to allocate the ABD's that will be issued down to the VDEV chldren. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-By: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes openzfs#12333
In order to reduce contention on the vq_lock, optional skip sectors for Raidz writes can be placed into a single IO request. This is done by padding out the linear ABD for a parity column to contain the skip sector and by creating gang ABD to contain the data and skip sector for data columns. The vdev_raidz_map_alloc() function now contains specific functions for both reads and write to allocate the ABD's that will be issued down to the VDEV chldren. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-By: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes openzfs#12333
In order to reduce contention on the vq_lock, optional skip sectors for Raidz writes can be placed into a single IO request. This is done by padding out the linear ABD for a parity column to contain the skip sector and by creating gang ABD to contain the data and skip sector for data columns. The vdev_raidz_map_alloc() function now contains specific functions for both reads and write to allocate the ABD's that will be issued down to the VDEV chldren. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-By: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes openzfs#12333
In order to reduce contention on the vq_lock, optional no data blocks
for Raidz are put into gang ABDs. This allows for a reduction on the
number of IO issued down to the children VDEVs reducing contention on
the vq_lock when issuing IO for skip sectors.
Signed-off-by: Brian Atkinson batkinson@lanl.gov
When generating optional IO's for skip sectors with raidz writes they can be placed in a gang ABD along with
the original data. This allows for a single IO to be issued which reduces contention on the child VDEV's
vq_lock
when they are placed in the VDEV queues.
Aggregation still takes place in
vdev_queue_aggregate()
for reads and writes usingvq_{read/write}_offset_tree
's.Motivation and Context
Reduces contention on the
vq_lock
which is a lock than can come under heavy contention as it it is grabbedfor every IO in the VDEV queues.
Description
Before issuing writes down the VDEV children in
vdev_raidz_io_start_write()
check for the optional skip sectors is checkand placed in gang ABD's with any original data before calling
zio_vdev_child_io()
. Originally the optional skip sector IO'swere issued in a second loop. However, by only issuing down a single IO to the VDEV children we can reduce contention
on each of the children's
vq_locks
.When measuring this with a raidz2 (10+2) the number of times the
vq_lock
is grabbed invdev_queue_io()
wasreduced by ~7% when writing 128, 16 GB files.
How Has This Been Tested?
Ran
zloop.sh
for 30 minsRan repeated tests with FIO and XDD.
CentOS 8.2
Kernel: 4.18.0-193.el8.x84_84
Server: Gigabyte R72-Z32-00
Single socket AMD EPYC Rome 7502 32c (Hyperthreading 64c)
128 GB RAM, 8x 16 GB 2933 MT/s DDR4
Ran a sweep with PR #10018 rebased off of master from 6-29-2021 with this commit applied on top. By using
O_DIRECT I could directly measure bandwidth improvements because the IO's are issued directly into the
VDEV queues. The following graph shows O_DIRECT with the initial
vdev_raidz_io_start_write()
(With Gang ABD)and this commit applied (Without Gang ABD). On average there was around 400 MB/s gained from reducing
contention on the
vq_lock
. These tests were run using 12 Samsung PM1725a NVMe's in a raidz2 (10+2) VDEV,recordsize=1M
, 1 MB IO requests, and 2 TB of data was written to the ZPool evenly distributed between thefile counts.
Types of changes
Checklist:
Signed-off-by
.