-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ZVOL queue behavior #554
Conversation
The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.
@etienne-dechamps-o Please review this behlendorf/zfs@ebb42f9 slightly reworked commit. It basically adds the needed autoconf checks and wrappers to map your tunings to the legacy versions of those function if they exist. If the concept being tuned doesn't exist in the older kernel that tuning is simply skipped. If your happy with these slightly reworked zvol stack of changes I'll merge them in to master. They pass all my tests for 2.6.26 - 3.2 kernels. It'll be nice to get these lchanges merged sorry it took so long. Also, wasn't there originally another tweak to change the default number of zvol threads? I didn't see that as a pull request, if you have another change for that which improves things I'm happy to take it. You've I'm sure done the most tuning for zvols so far. |
Merged in to master as commit 34037af |
You're talking about #392. I could write a patch to set |
Yes, that's the one. Sorry, I lost track of it. If you could open a new pull request with the change to 32 and a nice explanation of why (such as what's in #392) I'll get it in the tree. As you say in the bug, I don't see a downside to doing this even if 32 isn't perfectly optimal for all systems. |
@behlendorf Done in #567. |
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then does rw_enter(RW_READER) if it fails. This violate the assumption that rw_tryupgrade should be atomic and could cause extra contention or even lock inversion. This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we use cmpxchg on rwsem->count to change the value from single reader to single writer. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes openzfs#4692 Closes openzfs#554
…aster Merge remote-tracking branch '6.0/stage' into 'master'
The Linux block device queue subsystem exposes a number of configurable settings described in Linux
block/blk-settings.c
. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering.Detailed rationale:
max_hw_sectors
is set to unlimited (UINT_MAX
).zvol_write()
is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide.max_segments
andmax_segment_size
are set to unlimited.zvol_write()
will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide.physical_block_size
andio_opt
are set to the ZVOL's block size. This has the potential to somewhat alleviate issue Sequential file rewrite outside of block boundaries is dead slow #361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow.NONROT
flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.Note: #389 was the original pull request for this, but I made a new one since I seriously messed up the branch by doing unwanted merges. I'll try to be more careful this time.