Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ZVOL queue behavior #554

Closed
wants to merge 1 commit into from
Closed

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Feb 3, 2012

The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering.

Detailed rationale:

  • max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide.
  • max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide.
  • physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue Sequential file rewrite outside of block boundaries is dead slow #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow.
  • The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.

Note: #389 was the original pull request for this, but I made a new one since I seriously messed up the branch by doing unwanted merges. I'll try to be more careful this time.

The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.
@behlendorf
Copy link
Contributor

@etienne-dechamps-o Please review this behlendorf/zfs@ebb42f9 slightly reworked commit. It basically adds the needed autoconf checks and wrappers to map your tunings to the legacy versions of those function if they exist. If the concept being tuned doesn't exist in the older kernel that tuning is simply skipped.

If your happy with these slightly reworked zvol stack of changes I'll merge them in to master. They pass all my tests for 2.6.26 - 3.2 kernels. It'll be nice to get these lchanges merged sorry it took so long.

Also, wasn't there originally another tweak to change the default number of zvol threads? I didn't see that as a pull request, if you have another change for that which improves things I'm happy to take it. You've I'm sure done the most tuning for zvols so far.

@behlendorf
Copy link
Contributor

Merged in to master as commit 34037af

@behlendorf behlendorf closed this Feb 8, 2012
@dechamps
Copy link
Contributor Author

dechamps commented Feb 8, 2012

@behlendorf

Also, wasn't there originally another tweak to change the default number of zvol threads? I didn't see that as a pull request, if you have another change for that which improves things I'm happy to take it. You've I'm sure done the most tuning for zvols so far.

You're talking about #392. I could write a patch to set zvol_threads to 32 instead of the number of CPUs. It has a net positive effect on all systems I tested.

@behlendorf
Copy link
Contributor

Yes, that's the one. Sorry, I lost track of it. If you could open a new pull request with the change to 32 and a nice explanation of why (such as what's in #392) I'll get it in the tree. As you say in the bug, I don't see a downside to doing this even if 32 isn't perfectly optimal for all systems.

@dechamps
Copy link
Contributor Author

dechamps commented Feb 8, 2012

@behlendorf Done in #567.

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request May 21, 2018
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#4692
Closes openzfs#554
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
…aster

Merge remote-tracking branch '6.0/stage' into 'master'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants