-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEEK_HOLE should not block on txg_wait_synced() #5962
Conversation
@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @ahrens and @tuxoko to be potential reviewers. |
87a3d13
to
6773f2f
Compare
@@ -278,6 +278,10 @@ zfs_holey_common(struct inode *ip, int cmd, loff_t *off) | |||
if (error == ESRCH) | |||
return (SET_ERROR(ENXIO)); | |||
|
|||
/* file was dirty, so fall back to using file_sz logic */ | |||
if (error == EBUSY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than suppress this error I think it would be better to return immediately and allow it to percolate up to zpl_llseek()
and then handle it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the comment thread and I am adding here for completeness, I disagree. The correct place is in zfs_vnops.c because there is no zpl_*
analogue on other platforms. We'd risk introducing inconsistent behavior between platforms when the patch is adopted if someone misses the key change in the zpl_
function. Even if they do, they will need to put it here anyway and then we will need to do another patch to bring the platforms in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, upon further reflection I'm OK with that and withdraw my objection. The existing patch leaves this check where it is.
If the txg sync takes minutes, something is wrong that needs to be addressed. Ideally, this should be bounded by 5 seconds, which is still bad, but on a lower order of magnitude. I am not sure if I like this change because userland software might depend on the stricter behavior. Holes are a form of data after all. I suggest that we make this a module tunable instead of an unconditional change. That way we can introduce it in the next release and get feedback on the change while we decide whether to make it the default behavior. @behlendorf The things we pass to zpl_llseek are supposed to be the return codes that the VFS can emit as zpl_ is a wrapper around the Illumos functions. Passing EBUSY to it will set a trap for other OpenZFS platforms when adopting this code where it would be easy to introduce an EBUSY return value into lseek. Various documentation does not permit lseek to emit EBUSY, so that would cause divergence in behavior and potentially confuse userland software: http://man7.org/linux/man-pages/man2/lseek.2.html |
No it does not mean anything is wrong. I investigated this for a while. Even if ZFS owned entire block devices and had complete control over block scheduling, this is not universally true, and is definitely not true in my workload. ZFS asynchronous writeback must compete with other disk io's. I can change the ZFS io ramp and make it go reliably faster, but it is not down to the level we want (microseconds). If you look at these parameters, it's quite clear the even with ZFS owning all block scheduling, TXG writeback is still deprioritized by default. This is a good design decision, it prevents extra latency from the writeback.
It should not negatively impact userland. In the dirty case the client would have had to wait for the txg_sync to complete anyway. I think the original code only helped for a very specific userland workload, but hurt everyone else. The original patch is an attempt to enable an accelerated path, that didn't even exist in linux a while ago. But we can make it tunable anyway.
I don't believe that's what he was suggesting. EBUSY will not be returned to user here. There is no functional change suggested, only code clarity. I am traveling at the moment, but I planned on address this when I get back. |
@dbavatar If the txg commit takes minutes, dirty data can accumulate leading to the throttle. It is not a great situation and it means that the storage is overworked. I agree on txg sync writeback being deprioritized is a good decision. As for my remark to @behlendorf, the issue is how code sharing works between various platforms. Moving this into the zpl_* function could cause this to change semantics when ported to Illumos, which would be bad. Here, the zfs_* function is the Illumos VFS function and we are wrapping it. |
There's a strong case to be made that any userland software which does depend on this behavior is broken. As long as we always return the correct data for a hole, which we do, how those holes are managed and reported is a filesystem specific detail. That said, I'm OK with adding a module option to disable this optimization. Regarding code sharing that's a good point. The other ZFS implementations don't have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization looks fine to me. As @behlendorf suggested, a module option to disable the optimization could be added just in case the old behavior is necessary.
@behlendorf A module option to disable it sounds good to me. After mulling it over some more, I think software that relies on this behavior is broken, but adding an option to re-enable it will make things easier for that software to transition. |
Force flushing of txg's can be painfully slow when competing for disk io, since this is a process meant to execute asyncronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
6773f2f
to
89a7851
Compare
Makes previous optimization optional, in case there is a usecase that breaks Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
89a7851
to
01de657
Compare
@dbavatar thanks for adding the module option. I've merged this to master. |
Force flushing of txg's can be painfully slow when competing for disk io,
since this is a process meant to execute asyncronously. Optimize this path
via allowing data/hole seeking if the file is clean, but if dirty fall back
to old logic. This is a compromise to disabling the feature entirely.
Signed-off-by: Debabrata Banerjee dbanerje@akamai.com
Motivation and Context
#4306
How Has This Been Tested?
Create bar.log > 32kB via appending
Create diskio's with any tool of at least queue depth of 1 on all pool members. I used 1 outstanding random io per disk.
Test speed of "touch bar.log; time strace grep foo bar.log". When TXG's must be synced, takes seconds to minutes, being hung up at "lseek(3, 32768, SEEK_HOLE)". With the fix it takes a few milliseconds (normal time to grep).
Fix should not break correctness, however it falls back to not providing hole data when the file is dirty. This should satisfy everyone's use cases.
Types of changes
Checklist:
Signed-off-by
.