Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BRT: Linux FICLONE truncates large files with dirty blocks #15728

Closed
rrevans opened this issue Jan 2, 2024 · 9 comments
Closed

BRT: Linux FICLONE truncates large files with dirty blocks #15728

rrevans opened this issue Jan 2, 2024 · 9 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rrevans
Copy link
Contributor

rrevans commented Jan 2, 2024

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 39
Kernel Version 6.6.8-200.fc39
Architecture x86_64
OpenZFS Version zfs-2.2.99-281_g07e95b467 (built from source)

Describe the problem you're observing

cp --reflink=always succeeds but the output file is truncated for previously synced files that are recently modified at a large enough offset such that the range clone partially succeeds.

In this case clone sees one or more transaction's worth of synced block pointers that can be cloned followed by dirty blocks that are unable to be cloned yet. ZFS only makes a partial clone but cp later reports success.

From a quick read of coreutils, linux, and zfs code:

  1. coreutils calls ioctl(int dest_fd, FICLONE, int src_fd)
  2. the kernel calls zpl_remap_file_range(src, 0, dst, 0, 0, 0)
  3. __zpl_clone_file_range calls zfs_clone_range
  4. which stops at the first dirty block and returns the number of bytes cloned
  5. this propagates to ioctl_file_clone which only fails if negative when len == 0

TL;DR the kernel seems to expect that length == 0 implies the whole file cloning operation must succeed or fail atomically. ZFS should return -EINVAL when len == 0 and the operation fails to clone all blocks of the file. IOW, it seems that the kernel API makes it the responsibility of the vfs implementation to check, though the documentation does not spell that out.)

Describe how to reproduce the problem

  1. set recordsize=128k
  2. create a file of 1022 blocks
  3. wait for sync
  4. modify the file to add another block at offset 1022
  5. use cp --reflink=always to copy the file

Expected result: target file is the same as source (or cp fails to clone)

Actual result: target file is a truncated copy of the source file

Example script:

fill() {
  dd if=/dev/zero | tr '\0' 'Z'
}

rm -f x y

dd if=<(fill) iflag=fullblock bs=128k count=1022 of=x
zpool sync
dd if=<(fill) iflag=fullblock bs=128k seek=1022 count=1 conv=notrunc of=x
cp --reflink=always x y

ls -l x y

Example output:

1022+0 records in
1022+0 records out
133955584 bytes (134 MB, 128 MiB) copied, 0.221042 s, 606 MB/s
1+0 records in
1+0 records out
131072 bytes (131 kB, 128 KiB) copied, 0.00131304 s, 99.8 MB/s
-rw-r--r--. 1 vmuser vmuser 134086656 Jan  2 07:05 x
-rw-r--r--. 1 vmuser vmuser 133955584 Jan  2 07:05 y

I have also reproduced this same issue using zfs-2.2.0 release (95785196f2) on kernel 6.5.12-100.fc37.x86_64 (Fedora 37).

This issue also occurs with sparse files - replacing the initial count=1022 with count=2 above has the same result.

Note that reproducing this is dependent on recordsize, but can be reproduced with other sizes by changing the bs= argument accordingly.

Also YMMV reproducing if the ZIL buffer sizes are different (and thus fit more than or fewer than 1022 block pointers per transaction).

Include any warning/errors/backtraces from the system logs

No warnings or errors in dmesg, /proc/spl/kstat/zfs/dbgmsg (same for debug build), or cp.

@rrevans rrevans added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 2, 2024
@rrevans rrevans changed the title cp --reflink=always truncates previously synced files with dirty blocks BRT: Linux FICLONE truncates large files with dirty blocks Jan 18, 2024
@rrevans
Copy link
Contributor Author

rrevans commented Jan 18, 2024

Polite bump and FYI @robn @amotin

This looks to me like a blocker for reenabling block cloning in any upcoming release (at least for OpenZFS on Linux).

I don't know if FreeBSD is similarly affected.

@Vlad1mir-D
Copy link

Sadly, this issue and #15715 still prevents me from enabling BRT on prod.

@pjd
Copy link
Contributor

pjd commented Jan 19, 2024

If FICLONE expects the clone to always be atomic, I can see two possible work-arounds (if we can recognize this is FICLONE what is calling zfs_clone_range()):

  1. Try to squeeze more blocks into a single transaction, but we need to have a limit. If FICLONE asks to clone a file larger than the limit, we simply return an error.
  2. If dmu_read_l0_bps() returns EAGAIN (block is not yet on disk), we wait for the current transaction group to be committed and we repeat the read.
    The latter is less reliable.
    Also, we could always return an error if this is FICLONE...

@pjd
Copy link
Contributor

pjd commented Jan 19, 2024

Sadly, this issue and #15715 still prevents me from enabling BRT on prod.

I'm confused. Is #15715 related to BRT?

@rrevans
Copy link
Contributor Author

rrevans commented Jan 19, 2024

@pjd I investigated more here and upon a closer look FICLONE does not need to be atomic, but it does need to report success or failure for the entire file since FICLONE does not accept ranges.

So I believe all that is missing is a check for result < file size after the call to zfs_clone_range stops at the first dirty block.

@behlendorf
Copy link
Contributor

Are we sure it doesn't need to be atomic? The documentation doesn't really say. If that's truly the case then we could wait for as many transactions as it takes to clone the entire file.

@rrevans
Copy link
Contributor Author

rrevans commented Jan 20, 2024

Are we sure it doesn't need to be atomic?

Well I'm not sure, though atomic semantics are pretty rare in filesystems. Non-atomic behavior would be the same as write or copy_file_range partial write, but there's no way to return a length.

The lack of documented semantics for the output upon error implies it does not need to be atomic especially since the atomicity of the contents is documented:

ioctrl_fioclonerange(2):

Clones are atomic with regards to concurrent writes, so no locks need
to be taken to obtain a consistent cloned copy.

Allowing partial cloning would defeat this since a filesystem that can only partially clone cannot consistently clone an arbitrary range or entire file.

Meanwhile none of the defined error codes describe a partial clone outcome...

Documentation/filesystems/locking.rst repeats the atomicity of contents copying without describing errors:

->copy_file_range and ->remap_file_range implementations need to serialize
against modifications of file data while the operation is running.

Documentation/filesystems/vfs.rst seems to say both that it can return a partial result ("any bytes") and also cannot be fewer unless REMAP_FILE_CAN_SHORTEN is given:

The return value should the number of bytes
remapped, or the usual negative error code if errors occurred
before any bytes were remapped.
...
If REMAP_FILE_CAN_SHORTEN is set, the caller is
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason)

Reconciling this, it seems the vfs should not shorten the cloned range absent REMAP_FILE_CAN_SHORTEN to respect the caller trying to obtain an atomic clone. But, in the event of an unavoidable error, the vfs should report it by returning a partial length.

It seems allowable to treat dirty blocks as uncloneable and thus unavoidable, though it makes the feature less useful.

Nothing at all is said about the effect on the destination file upon error. Lacking a way to return a length, it seems reasonable that the output file state is undefined upon error i.e. not atomically overwritten. (It would be surprising to clobber the output in arbitrary ways, so maybe only states consistent with attempting clone should be allowed.)

Given all this, it seems adding a simple length check is sufficient, though a better implementation could do more. (Ideally recording pending clones in dirty dbufs - not forcing frequent performance-harming txg syncs.)

behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 1, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifing that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
@behlendorf
Copy link
Contributor

@rrevans @pjd @robn can you please take a look at the proposed fix in PR #15842. I'd appreciate the feedback.

Where I ended up on this is a version of 2. from Pawel's comment above.

If dmu_read_l0_bps() returns EAGAIN (block is not yet on disk), we wait for the current transaction group to be committed and we repeat the read.
The latter is less reliable.
Also, we could always return an error if this is FICLONE...

My reasoning is that with FICLONE / FICLONERANGE we want to make every reasonable effort to clone the bulks. Particularly given our inability to return any useful progress information to the caller. Waiting on the transaction groups when there's outstanding dirty records may be slow but it's what I'd personally expect in this case. We could consider adding yet-another-module-option if there's a scenario where we think this is less than ideal.

Nothing at all is said about the effect on the destination file upon error. Lacking a way to return a length, it seems reasonable that the output file state is undefined upon error i.e. not atomically overwritten. (It would be surprising to clobber the output in arbitrary ways, so maybe only states consistent with attempting clone should be allowed.)

As I mentioned in the PR this implementation doesn't provide strict atomicity but I think it comes close enough. It is possible to leave the destination file range in an undefined state if there's an error part way through the clone. Or if the node crashes during the system call. I considered zeroing that portion of the destination file on error but in the end decided that just added more complexity and it wasn't helpful anyway.

behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 1, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
@rrevans
Copy link
Contributor Author

rrevans commented Feb 1, 2024

Thanks @behlendorf for the fix. I'll rerun my reproducer and post results here in #15842.

I'm still concerned that forcing syncs is prohibitively expensive for some workloads like builds that may write and immediately copy various files, though having the feature work without regression is certainly strictly better.

behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 2, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 2, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 2, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 5, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file rangelock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Issue openzfs#15728
tonyhutter added a commit that referenced this issue Feb 22, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15728
Closes #15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#15728
Closes openzfs#15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#15728
Closes openzfs#15842
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 13, 2024
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#15728
Closes openzfs#15842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants