-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic error with SEEK_HOLE results (likely timing-related) in recent git version(s) #6050
Comments
Related to 66aca24? @siebenmann can you try and reproduce this same behaviour you're describing on a machine with the tunable |
This doesn't seem to reproduce at all with |
I've been chasing this bug in circles all day, at first I thought it was a problem with Gentoo, then a problem with grep, then libtool... Setting edit: this also appears to be able to fix my inability to compile gcc 5.4.0-r3.
|
I just checked and the filesystems that this happens most readily on for me have the default of As a test of this I temporarily set |
FWIW, all my pools run with both atime and relatime turned off. relatime gets turned on "temporarily" on the bootfs dataset at boot, which I turn off, and /var/tmp/portage resides on another dataset with both options turned off. |
Now that I look at the code in I suspect that the correct fix is something like the following change for
This explicitly sets the offset that will be returned to the end of the file for the dnode-is-dirty case, basically reporting the dnode as having no holes per the comment before |
@siebenmann right, in the EBUSY case we can't rely on |
@siebenmann You're right, it's doing the wrong thing in that case. I will submit a change. |
@siebenmann are you able to test dbavatar@4f21a00 ? |
I'm afraid I won't be able to test this in the near future, although it looks good to me. (Reproduction and testing should be relatively simple in a test environment, since I believe that all you need is a filesystem with atime on and relatime off, then you can use eg the Python example in my earlier comments to test the |
@dbavatar when you get a chance could you open a PR with the proposed fix. It'd be great to add the test case as well to the test suite. |
PR is in, but haven't been successful in hitting that path yet. |
@bunder2015 Would you test dbavatar/zfs@4f21a00? @dbavatar You could download a Gentoo stage3, put it in a chroot and try |
Looks good to me now, thanks |
Thanks everyone. The fix in #6053 has been merged. |
It looks like this is back in zfs master 0.7.0-rc4_12_g3d6da72. Using zfs_dmu_offset_next_sync=1 restores regular behaviour.
|
Ugh. Well I guess I will have to figure out how to reproduce. I wonder if this is a problem with the new algorithm or that code would fail anyway if this functionality was disabled entirely, because it really should not in that case. |
Are we sure this is the same issue? We should try and write a little torture test for the ZTS. It could create a bunch of threads which occasionally |
I set the tunable back to 0 yesterday and ran some more updates, I can't reproduce it now. I'll keep an eye on it for now. Sorry for the noise. edit: I'd be willing to test anything you can pass along as well. |
System information
Describe the problem you're observing
Sometimes a
SEEK_HOLE
lseek() will report that a file has a hole when it doesn't. The direct manifestation of the problem is that this causesgrep
to believe that a text file is a binary file and so report just 'Binary file WHATEVER matches'. This appears to be quite dependent on the specific operations involved and the exact timing (when I ran grep under gdb and was stepping through the code in question it didn't reproduce, although I could see it in strace), and even on the same files it doesn't always happen.For an affected file, I can reproduce this in the Python repl:
This file doesn't have any holes, and if you don't
read()
or add enough of a pause the results are different:Inserting a three second sleep between the read and the lseek will sometimes make this manifest and other times not. Four seconds seems to be a sufficient delay to make it reliably return the correct results. Short waits reliably make it return the wrong results.
This doesn't appear tied to the actual file object in question. On the most consistent reproducer, I can copy the file (including with
cat
) and the copy has the problem as well. On some files this seems extremely consistent; on others it comes and goes; I can run a test command and on consecutive runs it will report:This may have something to do with the file size. I can readily reproduce this with a completely synthetic file of length 33005, created in Python's REPL with:
This reproduces across two different Fedora machines, although they have the same git near-tip ZFS versions.
The text was updated successfully, but these errors were encountered: