-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to ZFS volume creates correctly sized files filled with \0 #8816
Comments
I suspect the origin files have to be very fresh to reproduce this bug. The most commonly affected files are those which were touched last. For If I split the
If I insert If I If I set |
Reported upstream to gentoo https://bugs.gentoo.org/635002#c18 |
Pool features
I did turn the new |
I have plenty of systems running 0.8.0 on gentoo but most of them are running non-upgraded pools. I've been hit by previous iteration of this, but can't observe it with 0.8.0 yet. manual trim couple of times, but no autotrim yet
|
I have managed to reproduce this without installing packages (though it still requires Portage's python libraries installed). Python3 script here: https://gist.github.com/abrasive/dd85c7bb4686200927df660a2b4f1d93
It also reproduces the issue when copying within the same dataset, the cross-dataset thing is a red herring. The repro works if a scrub is running on the pool at the time. I've tried loading it in other ways (eg. reading all the files on the FS sequentially in the background) but have not had a measurable repro rate with those methods. |
does it still do it if you disable native-extensions for the build? |
@beren12 no, with native-extensions disabled the problem does not occur. |
I have managed to observe the issue at the syscall level. The problem here is that the copy routine is sparsity-aware, and so it looks for the first data chunk by doing For example, here is a trace of a successful copy operation:
And here is the immediately following copy, which failed:
|
I should note, for the record, that the test script in question creates 1000 files in a directory sequentially, then copies them all, so they've all been closed and they |
Here is a pure C repro of the https://gist.github.com/abrasive/f062f967cc41c797bd08ec11f30f8fbf As before I find I need to be loading the pool to reliably trigger the bug; scrubbing remains particularly effective. |
Tested also on: Gentoo Raspbian |
What happens if you change https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_dmu_offset_next_sync ? |
@richardelling I see no change, the bug still occurs. |
|
Issue: openzfs/zfs#8816 Issue: openzfs/zfs#8778 Bug: https://bugs.gentoo.org/635002 Package-Manager: Portage-2.3.67, Repoman-2.3.12 Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Thanks @behlendorf! Reverting ec4f9b8 works for me.
The test code is all yours to use as you wish.
How quickly did it tend to reproduce for you? On my NVMe system it
would sometimes take 20+ runs, which misled me when I tried doing a
bit of bisection myself.
|
@abrasive thanks. Surprisingly quickly in my test environment, I was able to reproduce the issue usually in under a minute but it would take multiple runs. |
Do I have to recreate the pools afterwards or can I leave it like this? I just reinstalled with 0.8.0 before I saw it? Thanks... |
@misterhsp this issue would not cause any damage to your pool. What could happen, is that immediately after performing a write there was a race where a call to |
Am 01.06.19 um 03:41 schrieb Brian Behlendorf:
@misterhsp <https://github.com/misterhsp> this issue would not cause
any damage to your pool. What could happen, is that /immediately/
after performing a write there was a race where a call to |lseek(.. ,
SEEK_DATA)| could incorrectly return |ENXIO|. This indicated that the
file offset was past the end of the file and could result in data not
being copied. Gentoo's portage was particularly good and hitting this,
and they've already patched their packages.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8816>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKGJLRNP4KIELO5RCBHGVOLPYHHUHANCNFSM4HPVUHRA>.
Debian already has this patch.
* Revert ec4f9b8 to avoid potential dataloss. (See: Github#8816)
If I understood everything correctly, can I leave my pools like this
even though they were created with the Debian packages that didn't have
this patch yet?
|
@misterhsp Yes, your pools should be left alone. The effect of this defect is, when using sparse-aware tools to copy/move newly created files from a ZFS dataset, that the copied files in the destination might be empty. This combination of circumstances is uncommon, but if it did trigger, then those destination files are just regular empty files. There is no corruption of the pool data structures. |
Am 01.06.19 um 12:10 schrieb James Laird-Wah:
@misterhsp <https://github.com/misterhsp> Yes, your pools should be
left alone.
The effect of this defect is, when using sparse-aware tools to
copy/move newly created files from a ZFS dataset, that the copied
files in the destination might be empty. This combination of
circumstances is uncommon, but if it did trigger, then those
destination files are just regular empty files.
There is no corruption of the pool data structures.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8816>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKGJLRJOIIRQLLF2DW6Y74DPYJDH5ANCNFSM4HPVUHRA>.
Thank you, now i get it.
...
|
This reverts commit ec4f9b8 which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8816 Closes #8834
Only slightly related comment regarding
NOP write needs a strong checksum, having |
Issue: openzfs/zfs#8816 Issue: openzfs/zfs#8778 Bug: https://bugs.gentoo.org/635002 Package-Manager: Portage-2.3.67, Repoman-2.3.12 Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>
System information
Describe the problem you're observing
Since upgrading to 0.8.0 and attempting to upgrade system packages, I began to find newly installed files filled with nulls, but with sane sizes.
On this system, packages are compiled on one dataset (
compression=lz4
), before being installed to other datasets (compression=lz4
andcompression=gzip
).The pattern of nulling is not consistent, but it is bursty, ie. several related files with similar inode numbers, which were copied sequentially, tend to be affected.
The behaviour is similar to that observed in #3125. Setting
zfs_nopwrite_enabled=0
appears to have no effect.dedup=off
,checksum=on
,sync=standard
on all datasets.This problem was not observed on 0.7.13 which I had been running for quite some time. Portage has not changed so I think this is triggered by a ZFS change. Unfortunately, like an optimist, I upgraded the pool already, and since
feature@spacemap_v2=active
I can't go back to test v0.7.13 again.Disabling portage's
native-extensions
USE flag, preventing portage from using fancy kernel ops for faster file copying, prevents the problem from occurring.Examining a nulled file with
zdb -ddddd
:Describe how to reproduce the problem
Choosing
iasl
as a small package which has exhibited the problem, we can repeatedly reinstall it without rebuilding it every time until the problem occurs:This will typically cause a failure within 2-3 cycles.
I have attempted to write a test script to copy files using Portage's native extension, but have been unable to reproduce the issue (recursively copying a directory of files from one dataset to another, or generating a single directory of files and then copying those).
Include any warning/errors/backtraces from the system logs
No warnings or errors observed.
The text was updated successfully, but these errors were encountered: