Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating gang ABDs for Raidz optional IOs #12333

Merged
merged 1 commit into from
Nov 9, 2021

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Jul 7, 2021

In order to reduce contention on the vq_lock, optional no data blocks
for Raidz are put into gang ABDs. This allows for a reduction on the
number of IO issued down to the children VDEVs reducing contention on
the vq_lock when issuing IO for skip sectors.

Signed-off-by: Brian Atkinson batkinson@lanl.gov

When generating optional IO's for skip sectors with raidz writes they can be placed in a gang ABD along with
the original data. This allows for a single IO to be issued which reduces contention on the child VDEV's vq_lock
when they are placed in the VDEV queues.

Aggregation still takes place in vdev_queue_aggregate() for reads and writes using vq_{read/write}_offset_tree's.

Motivation and Context

Reduces contention on the vq_lock which is a lock than can come under heavy contention as it it is grabbed
for every IO in the VDEV queues.

Description

Before issuing writes down the VDEV children in vdev_raidz_io_start_write() check for the optional skip sectors is check
and placed in gang ABD's with any original data before calling zio_vdev_child_io(). Originally the optional skip sector IO's
were issued in a second loop. However, by only issuing down a single IO to the VDEV children we can reduce contention
on each of the children's vq_locks.

When measuring this with a raidz2 (10+2) the number of times the vq_lock is grabbed in vdev_queue_io() was
reduced by ~7% when writing 128, 16 GB files.

How Has This Been Tested?

Ran zloop.sh for 30 mins
Ran repeated tests with FIO and XDD.

CentOS 8.2
Kernel: 4.18.0-193.el8.x84_84
Server: Gigabyte R72-Z32-00
Single socket AMD EPYC Rome 7502 32c (Hyperthreading 64c)
128 GB RAM, 8x 16 GB 2933 MT/s DDR4

Ran a sweep with PR #10018 rebased off of master from 6-29-2021 with this commit applied on top. By using
O_DIRECT I could directly measure bandwidth improvements because the IO's are issued directly into the
VDEV queues. The following graph shows O_DIRECT with the initial vdev_raidz_io_start_write() (With Gang ABD)
and this commit applied (Without Gang ABD). On average there was around 400 MB/s gained from reducing
contention on the vq_lock. These tests were run using 12 Samsung PM1725a NVMe's in a raidz2 (10+2) VDEV,
recordsize=1M, 1 MB IO requests, and 2 TB of data was written to the ZPool evenly distributed between the
file counts.

zfs_dio_with_wo_gang_abd_mult_files_write

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf requested review from behlendorf and ahrens July 7, 2021 23:34
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from 8849c6d to 2a4025d Compare July 8, 2021 03:02
@amotin
Copy link
Member

amotin commented Jul 8, 2021

I am not very deep in raidz code, but it seems to me that previous code marked optional I/Os as such, and they could be skipped if there is nothing for them to aggregate with. Your code sends all writes as mandatory. In case of large block where every disk get some traffic it is probably not important and can be counted as the price of the optimization, but in case of very small block some disks may get padding writes that may be not needed at all. Am I wrong?

@bwatkinson
Copy link
Contributor Author

bwatkinson commented Jul 8, 2021

I am not very deep in raidz code, but it seems to me that previous code marked optional I/Os as such, and they could be skipped if there is nothing for them to aggregate with. Your code sends all writes as mandatory. In case of large block where every disk get some traffic it is probably not important and can be counted as the price of the optimization, but in case of very small block some disks may get padding writes that may be not needed at all. Am I wrong?

Yes, this is true. So my thought here was to follow the idea of dRAID where skip sectors are just added to the gang ABD. With the O_DIRECT PR #10018, I noticed dRAID was out performing raidz. Looking at flame graphs, raidz IO's were waiting on the children VDEV's vq_lock in vdev_queue_io(). These gang ABD's can still be aggregated in vdev_queue_aggregate() as well.

However, you make a good point and I went ahead and update the case of:
rc->rc_size == 0 which implies rc->rc_abd == NULL
With ZIO_FLAG_OPTIONAL. That now allows these gang ABDs to possibly be skipped as was done previously.

@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from 2a4025d to 848d07a Compare July 8, 2021 04:30
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 8, 2021
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch 2 times, most recently from 43ff258 to db51013 Compare July 8, 2021 17:40
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from db51013 to 097ad0a Compare July 8, 2021 22:25
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from 097ad0a to c4948fe Compare July 9, 2021 00:49
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 more nits and I'll be satisfied. ;)

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from c4948fe to dccc659 Compare July 9, 2021 01:52
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from dccc659 to b1faf12 Compare July 14, 2021 00:25
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from b1faf12 to d6d71d5 Compare July 14, 2021 02:35
@amotin
Copy link
Member

amotin commented Jul 14, 2021

@bwatkinson Thanks. But just to be sure, I've just noticed that vdev_raidz_child_done() is also used by draid, which also uses gangs during write. Have you checked that this change won't affect draid?

@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from d6d71d5 to f0723f0 Compare July 14, 2021 17:51
@bwatkinson
Copy link
Contributor Author

@bwatkinson Thanks. But just to be sure, I've just noticed that vdev_raidz_child_done() is also used by draid, which also uses gangs during write. Have you checked that this change won't affect draid?

Good point. I went ahead and updated this again. I created a vdev_draid_child_done() callback that does not check for or free the gang ABD.

@sempervictus
Copy link
Contributor

This seems pretty handy - @amotin has it as approved, but it doesn't carry the relevant tag yet. @behlendorf: are there any blockers/concerns for merging?

@mmaybee
Copy link
Contributor

mmaybee commented Aug 17, 2021

I would like to see a second review, by someone familiar with the RAIDZ and dRAID code paths (probably @behlendorf or me).

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch 3 times, most recently from 4ac8918 to 2c09b66 Compare September 14, 2021 23:19
@bwatkinson
Copy link
Contributor Author

@ahrens would you be willing to review this as well? I wasn't sure how this might affect the raidz expansion code.

@behlendorf
Copy link
Contributor

@bwatkinson a rebase of this is probably in order to a fresh test run. @mmaybee since you're familiar with this part of the code can you review this.

@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from 7dc7156 to 1b0091f Compare November 5, 2021 22:32
@bwatkinson
Copy link
Contributor Author

@bwatkinson a rebase of this is probably in order to a fresh test run. @mmaybee since you're familiar with this part of the code can you review this.

@behlendorf I went ahead and did a rebase.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 5, 2021
Copy link
Contributor

@mmaybee mmaybee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have a few nit comments.

Comment on lines 1642 to 1659
/*
* Generate optional I/O for single skip sector to improve
* aggregation contiguity.
*/
if (rc->rc_size == 0) {
ASSERT3P(rc->rc_abd, ==, NULL);
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, NULL, 1ULL << ashift,
zio->io_type, zio->io_priority,
ZIO_FLAG_NODATA | ZIO_FLAG_OPTIONAL, NULL,
NULL));
} else {
ASSERT3P(rc->rc_abd, !=, NULL);
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_abd,
abd_get_size(rc->rc_abd), zio->io_type,
zio->io_priority, 0, vdev_raidz_child_done, rc));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ordering here is a little confusing. The primary purpose of this loop is to issue the IOs for the "normal" raid children, and then there is a "special case" if the child is a skip sector. Could we write this like:

    if (rc->rc_size > 0) {
        ...assert and nowait the child io...
    } else {
        /* Generate optional write for skip sector */
        ...assert and nowait child io...
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I went ahead and updated this.

ASSERT3U(rm->rm_nrows, ==, 1);

/*
* We will pad any parity columns with additional space to account for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop the "We will": Pad any parity columns with additional space to account for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

* a single I/O that contains the data and skip sector.
*
* It is important to make sure that rc_size is not updated
* even though we are adding a skip sector to the ABD. With
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: With -> When

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

* It is important to make sure that rc_size is not updated
* even though we are adding a skip sector to the ABD. With
* calculating the parity in vdev_raidz_generate_parity_row()
* the rc_size is used for iterating through the ABD's. We
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for iterating -> to iterate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this.

In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.

The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
@bwatkinson bwatkinson force-pushed the raidz_nodata_gang_abd branch from 1b0091f to de3ed2d Compare November 9, 2021 00:30
@mmaybee mmaybee merged commit 345196b into openzfs:master Nov 9, 2021
@jumbi77 jumbi77 mentioned this pull request Feb 7, 2022
13 tasks
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 22, 2022
In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.

The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-By: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#12333
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 22, 2022
In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.

The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-By: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#12333
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Oct 23, 2022
In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.

The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-By: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes openzfs#12333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants