Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEEK_HOLE should not block on txg_wait_synced() #5962

Closed
wants to merge 2 commits into from

Conversation

dbavatar
Copy link
Contributor

@dbavatar dbavatar commented Apr 3, 2017

Force flushing of txg's can be painfully slow when competing for disk io,
since this is a process meant to execute asyncronously. Optimize this path
via allowing data/hole seeking if the file is clean, but if dirty fall back
to old logic. This is a compromise to disabling the feature entirely.

Signed-off-by: Debabrata Banerjee dbanerje@akamai.com

Motivation and Context

#4306

How Has This Been Tested?

Create bar.log > 32kB via appending
Create diskio's with any tool of at least queue depth of 1 on all pool members. I used 1 outstanding random io per disk.
Test speed of "touch bar.log; time strace grep foo bar.log". When TXG's must be synced, takes seconds to minutes, being hung up at "lseek(3, 32768, SEEK_HOLE)". With the fix it takes a few milliseconds (normal time to grep).
Fix should not break correctness, however it falls back to not providing hole data when the file is dirty. This should satisfy everyone's use cases.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@mention-bot
Copy link

@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @tuxoko to be potential reviewers.

@behlendorf behlendorf requested a review from don-brady April 3, 2017 21:25
@dbavatar dbavatar force-pushed the dbavatar/lseek branch 4 times, most recently from 87a3d13 to 6773f2f Compare April 4, 2017 19:32
@@ -278,6 +278,10 @@ zfs_holey_common(struct inode *ip, int cmd, loff_t *off)
if (error == ESRCH)
return (SET_ERROR(ENXIO));

/* file was dirty, so fall back to using file_sz logic */
if (error == EBUSY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than suppress this error I think it would be better to return immediately and allow it to percolate up to zpl_llseek() and then handle it there.

Copy link
Contributor

@ryao ryao Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the comment thread and I am adding here for completeness, I disagree. The correct place is in zfs_vnops.c because there is no zpl_* analogue on other platforms. We'd risk introducing inconsistent behavior between platforms when the patch is adopted if someone misses the key change in the zpl_ function. Even if they do, they will need to put it here anyway and then we will need to do another patch to bring the platforms in sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, upon further reflection I'm OK with that and withdraw my objection. The existing patch leaves this check where it is.

@ryao
Copy link
Contributor

ryao commented Apr 8, 2017

If the txg sync takes minutes, something is wrong that needs to be addressed. Ideally, this should be bounded by 5 seconds, which is still bad, but on a lower order of magnitude.

I am not sure if I like this change because userland software might depend on the stricter behavior. Holes are a form of data after all. I suggest that we make this a module tunable instead of an unconditional change. That way we can introduce it in the next release and get feedback on the change while we decide whether to make it the default behavior.

@behlendorf The things we pass to zpl_llseek are supposed to be the return codes that the VFS can emit as zpl_ is a wrapper around the Illumos functions. Passing EBUSY to it will set a trap for other OpenZFS platforms when adopting this code where it would be easy to introduce an EBUSY return value into lseek. Various documentation does not permit lseek to emit EBUSY, so that would cause divergence in behavior and potentially confuse userland software:

http://man7.org/linux/man-pages/man2/lseek.2.html
https://docs.oracle.com/cd/E26502_01/html/E29032/lseek-2.html
https://illumos.org/man/2/lseek

@dbavatar
Copy link
Contributor Author

dbavatar commented Apr 8, 2017

If the txg sync takes minutes, something is wrong that needs to be addressed. Ideally, this should be bounded by 5 seconds, which is still bad, but on a lower order of magnitude.

No it does not mean anything is wrong. I investigated this for a while. Even if ZFS owned entire block devices and had complete control over block scheduling, this is not universally true, and is definitely not true in my workload. ZFS asynchronous writeback must compete with other disk io's. I can change the ZFS io ramp and make it go reliably faster, but it is not down to the level we want (microseconds). If you look at these parameters, it's quite clear the even with ZFS owning all block scheduling, TXG writeback is still deprioritized by default. This is a good design decision, it prevents extra latency from the writeback.

I am not sure if I like this change because userland software might depend on the stricter behavior. Holes are a form of data after all. I suggest that we make this a module tunable instead of an unconditional change. That way we can introduce it in the next release and get feedback on the change while we decide whether to make it the default behavior.

It should not negatively impact userland. In the dirty case the client would have had to wait for the txg_sync to complete anyway. I think the original code only helped for a very specific userland workload, but hurt everyone else. The original patch is an attempt to enable an accelerated path, that didn't even exist in linux a while ago. But we can make it tunable anyway.

@behlendorf The things we pass to zpl_llseek are supposed to be the return codes that the VFS can emit as zpl_ is a wrapper around the Illumos functions. Passing EBUSY to it will set a trap for other OpenZFS platforms when adopting this code where it would be easy to introduce an EBUSY return value into lseek. Various documentation does not permit lseek to emit EBUSY, so that would cause divergence in behavior and potentially confuse userland software:

I don't believe that's what he was suggesting. EBUSY will not be returned to user here. There is no functional change suggested, only code clarity. I am traveling at the moment, but I planned on address this when I get back.

@ryao
Copy link
Contributor

ryao commented Apr 8, 2017

@dbavatar If the txg commit takes minutes, dirty data can accumulate leading to the throttle. It is not a great situation and it means that the storage is overworked. I agree on txg sync writeback being deprioritized is a good decision.

As for my remark to @behlendorf, the issue is how code sharing works between various platforms. Moving this into the zpl_* function could cause this to change semantics when ported to Illumos, which would be bad. Here, the zfs_* function is the Illumos VFS function and we are wrapping it.

@behlendorf
Copy link
Contributor

I am not sure if I like this change because userland software might depend on the stricter behavior.

There's a strong case to be made that any userland software which does depend on this behavior is broken. As long as we always return the correct data for a hole, which we do, how those holes are managed and reported is a filesystem specific detail. That said, I'm OK with adding a module option to disable this optimization.

Regarding code sharing that's a good point. The other ZFS implementations don't have a zpl_* layer of functions register the zfs_* function with their VFS. If they were to adopt this change that might see this EBUSY. Although, presumably this has never been an issue for other platforms since their system tools don't look for holes. I'm OK leaving the error handling where it is as long as offset is updated correctly.

Copy link
Contributor

@dinatale2 dinatale2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization looks fine to me. As @behlendorf suggested, a module option to disable the optimization could be added just in case the old behavior is necessary.

@ryao
Copy link
Contributor

ryao commented Apr 11, 2017

@behlendorf A module option to disable it sounds good to me. After mulling it over some more, I think software that relies on this behavior is broken, but adding an option to re-enable it will make things easier for that software to transition.

Force flushing of txg's can be painfully slow when competing for disk
io, since this is a process meant to execute asyncronously. Optimize
this path via allowing data/hole seeking if the file is clean, but if
dirty fall back to old logic. This is a compromise to disabling the
feature entirely.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Makes previous optimization optional, in case there is a usecase that
breaks

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
@behlendorf
Copy link
Contributor

@dbavatar thanks for adding the module option. I've merged this to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants