SEEK_HOLE should not block on txg_wait_synced() #5962

dbavatar · 2017-04-03T21:08:36Z

Force flushing of txg's can be painfully slow when competing for disk io,
since this is a process meant to execute asyncronously. Optimize this path
via allowing data/hole seeking if the file is clean, but if dirty fall back
to old logic. This is a compromise to disabling the feature entirely.

Signed-off-by: Debabrata Banerjee dbanerje@akamai.com

Motivation and Context

#4306

How Has This Been Tested?

Create bar.log > 32kB via appending
Create diskio's with any tool of at least queue depth of 1 on all pool members. I used 1 outstanding random io per disk.
Test speed of "touch bar.log; time strace grep foo bar.log". When TXG's must be synced, takes seconds to minutes, being hung up at "lseek(3, 32768, SEEK_HOLE)". With the fix it takes a few milliseconds (normal time to grep).
Fix should not break correctness, however it falls back to not providing hole data when the file is dirty. This should satisfy everyone's use cases.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

mention-bot · 2017-04-03T21:08:40Z

@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @tuxoko to be potential reviewers.

behlendorf · 2017-04-04T23:31:30Z

module/zfs/zfs_vnops.c

@@ -278,6 +278,10 @@ zfs_holey_common(struct inode *ip, int cmd, loff_t *off)
 	if (error == ESRCH)
 		return (SET_ERROR(ENXIO));

+	/* file was dirty, so fall back to using file_sz logic */
+	if (error == EBUSY)


Rather than suppress this error I think it would be better to return immediately and allow it to percolate up to zpl_llseek() and then handle it there.

As I said in the comment thread and I am adding here for completeness, I disagree. The correct place is in zfs_vnops.c because there is no zpl_* analogue on other platforms. We'd risk introducing inconsistent behavior between platforms when the patch is adopted if someone misses the key change in the zpl_ function. Even if they do, they will need to put it here anyway and then we will need to do another patch to bring the platforms in sync.

Yes, upon further reflection I'm OK with that and withdraw my objection. The existing patch leaves this check where it is.

ryao · 2017-04-08T16:25:16Z

If the txg sync takes minutes, something is wrong that needs to be addressed. Ideally, this should be bounded by 5 seconds, which is still bad, but on a lower order of magnitude.

I am not sure if I like this change because userland software might depend on the stricter behavior. Holes are a form of data after all. I suggest that we make this a module tunable instead of an unconditional change. That way we can introduce it in the next release and get feedback on the change while we decide whether to make it the default behavior.

@behlendorf The things we pass to zpl_llseek are supposed to be the return codes that the VFS can emit as zpl_ is a wrapper around the Illumos functions. Passing EBUSY to it will set a trap for other OpenZFS platforms when adopting this code where it would be easy to introduce an EBUSY return value into lseek. Various documentation does not permit lseek to emit EBUSY, so that would cause divergence in behavior and potentially confuse userland software:

http://man7.org/linux/man-pages/man2/lseek.2.html
https://docs.oracle.com/cd/E26502_01/html/E29032/lseek-2.html
https://illumos.org/man/2/lseek

dbavatar · 2017-04-08T18:36:51Z

If the txg sync takes minutes, something is wrong that needs to be addressed. Ideally, this should be bounded by 5 seconds, which is still bad, but on a lower order of magnitude.

No it does not mean anything is wrong. I investigated this for a while. Even if ZFS owned entire block devices and had complete control over block scheduling, this is not universally true, and is definitely not true in my workload. ZFS asynchronous writeback must compete with other disk io's. I can change the ZFS io ramp and make it go reliably faster, but it is not down to the level we want (microseconds). If you look at these parameters, it's quite clear the even with ZFS owning all block scheduling, TXG writeback is still deprioritized by default. This is a good design decision, it prevents extra latency from the writeback.

I am not sure if I like this change because userland software might depend on the stricter behavior. Holes are a form of data after all. I suggest that we make this a module tunable instead of an unconditional change. That way we can introduce it in the next release and get feedback on the change while we decide whether to make it the default behavior.

It should not negatively impact userland. In the dirty case the client would have had to wait for the txg_sync to complete anyway. I think the original code only helped for a very specific userland workload, but hurt everyone else. The original patch is an attempt to enable an accelerated path, that didn't even exist in linux a while ago. But we can make it tunable anyway.

@behlendorf The things we pass to zpl_llseek are supposed to be the return codes that the VFS can emit as zpl_ is a wrapper around the Illumos functions. Passing EBUSY to it will set a trap for other OpenZFS platforms when adopting this code where it would be easy to introduce an EBUSY return value into lseek. Various documentation does not permit lseek to emit EBUSY, so that would cause divergence in behavior and potentially confuse userland software:

I don't believe that's what he was suggesting. EBUSY will not be returned to user here. There is no functional change suggested, only code clarity. I am traveling at the moment, but I planned on address this when I get back.

ryao · 2017-04-08T18:58:40Z

@dbavatar If the txg commit takes minutes, dirty data can accumulate leading to the throttle. It is not a great situation and it means that the storage is overworked. I agree on txg sync writeback being deprioritized is a good decision.

As for my remark to @behlendorf, the issue is how code sharing works between various platforms. Moving this into the zpl_* function could cause this to change semantics when ported to Illumos, which would be bad. Here, the zfs_* function is the Illumos VFS function and we are wrapping it.

behlendorf · 2017-04-09T18:29:59Z

I am not sure if I like this change because userland software might depend on the stricter behavior.

There's a strong case to be made that any userland software which does depend on this behavior is broken. As long as we always return the correct data for a hole, which we do, how those holes are managed and reported is a filesystem specific detail. That said, I'm OK with adding a module option to disable this optimization.

Regarding code sharing that's a good point. The other ZFS implementations don't have a zpl_* layer of functions register the zfs_* function with their VFS. If they were to adopt this change that might see this EBUSY. Although, presumably this has never been an issue for other platforms since their system tools don't look for holes. I'm OK leaving the error handling where it is as long as offset is updated correctly.

dinatale2

This optimization looks fine to me. As @behlendorf suggested, a module option to disable the optimization could be added just in case the old behavior is necessary.

ryao · 2017-04-11T22:37:16Z

@behlendorf A module option to disable it sounds good to me. After mulling it over some more, I think software that relies on this behavior is broken, but adding an option to re-enable it will make things easier for that software to transition.

Force flushing of txg's can be painfully slow when competing for disk io, since this is a process meant to execute asyncronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Makes previous optimization optional, in case there is a usecase that breaks Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

behlendorf · 2017-04-13T17:49:32Z

@dbavatar thanks for adding the module option. I've merged this to master.

behlendorf requested a review from don-brady April 3, 2017 21:25

dbavatar force-pushed the dbavatar/lseek branch 4 times, most recently from 87a3d13 to 6773f2f Compare April 4, 2017 19:32

behlendorf requested changes Apr 4, 2017

View reviewed changes

dinatale2 approved these changes Apr 11, 2017

View reviewed changes

dbavatar force-pushed the dbavatar/lseek branch from 6773f2f to 89a7851 Compare April 12, 2017 18:56

Add zfs_dmu_offset_next_sync

01de657

Makes previous optimization optional, in case there is a usecase that breaks Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dbavatar force-pushed the dbavatar/lseek branch from 89a7851 to 01de657 Compare April 12, 2017 19:04

behlendorf approved these changes Apr 12, 2017

View reviewed changes

dinatale2 approved these changes Apr 13, 2017

View reviewed changes

gmelikov approved these changes Apr 13, 2017

View reviewed changes

behlendorf closed this in 66aca24 Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEEK_HOLE should not block on txg_wait_synced() #5962

SEEK_HOLE should not block on txg_wait_synced() #5962

dbavatar commented Apr 3, 2017 •

edited

Loading

mention-bot commented Apr 3, 2017

behlendorf Apr 4, 2017

ryao Apr 12, 2017 •

edited

Loading

behlendorf Apr 13, 2017

ryao commented Apr 8, 2017 •

edited

Loading

dbavatar commented Apr 8, 2017

ryao commented Apr 8, 2017

behlendorf commented Apr 9, 2017

dinatale2 left a comment

ryao commented Apr 11, 2017 •

edited

Loading

behlendorf commented Apr 13, 2017

SEEK_HOLE should not block on txg_wait_synced() #5962

SEEK_HOLE should not block on txg_wait_synced() #5962

Conversation

dbavatar commented Apr 3, 2017 • edited Loading

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

mention-bot commented Apr 3, 2017

behlendorf Apr 4, 2017

Choose a reason for hiding this comment

ryao Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

behlendorf Apr 13, 2017

Choose a reason for hiding this comment

ryao commented Apr 8, 2017 • edited Loading

dbavatar commented Apr 8, 2017

ryao commented Apr 8, 2017

behlendorf commented Apr 9, 2017

dinatale2 left a comment

Choose a reason for hiding this comment

ryao commented Apr 11, 2017 • edited Loading

behlendorf commented Apr 13, 2017

dbavatar commented Apr 3, 2017 •

edited

Loading

ryao Apr 12, 2017 •

edited

Loading

ryao commented Apr 8, 2017 •

edited

Loading

ryao commented Apr 11, 2017 •

edited

Loading