BRT: Linux FICLONE truncates large files with dirty blocks #15728

rrevans · 2024-01-02T14:03:25Z

System information

Type	Version/Name
Distribution Name	Fedora
Distribution Version	39
Kernel Version	6.6.8-200.fc39
Architecture	x86_64
OpenZFS Version	zfs-2.2.99-281_g07e95b467 (built from source)

Describe the problem you're observing

cp --reflink=always succeeds but the output file is truncated for previously synced files that are recently modified at a large enough offset such that the range clone partially succeeds.

In this case clone sees one or more transaction's worth of synced block pointers that can be cloned followed by dirty blocks that are unable to be cloned yet. ZFS only makes a partial clone but cp later reports success.

From a quick read of coreutils, linux, and zfs code:

coreutils calls ioctl(int dest_fd, FICLONE, int src_fd)
the kernel calls zpl_remap_file_range(src, 0, dst, 0, 0, 0)
__zpl_clone_file_range calls zfs_clone_range
which stops at the first dirty block and returns the number of bytes cloned
this propagates to ioctl_file_clone which only fails if negative when len == 0

TL;DR the kernel seems to expect that length == 0 implies the whole file cloning operation must succeed or fail atomically. ZFS should return -EINVAL when len == 0 and the operation fails to clone all blocks of the file. IOW, it seems that the kernel API makes it the responsibility of the vfs implementation to check, though the documentation does not spell that out.)

Describe how to reproduce the problem

set recordsize=128k
create a file of 1022 blocks
wait for sync
modify the file to add another block at offset 1022
use cp --reflink=always to copy the file

Expected result: target file is the same as source (or cp fails to clone)

Actual result: target file is a truncated copy of the source file

Example script:

fill() {
  dd if=/dev/zero | tr '\0' 'Z'
}

rm -f x y

dd if=<(fill) iflag=fullblock bs=128k count=1022 of=x
zpool sync
dd if=<(fill) iflag=fullblock bs=128k seek=1022 count=1 conv=notrunc of=x
cp --reflink=always x y

ls -l x y

Example output:

1022+0 records in
1022+0 records out
133955584 bytes (134 MB, 128 MiB) copied, 0.221042 s, 606 MB/s
1+0 records in
1+0 records out
131072 bytes (131 kB, 128 KiB) copied, 0.00131304 s, 99.8 MB/s
-rw-r--r--. 1 vmuser vmuser 134086656 Jan  2 07:05 x
-rw-r--r--. 1 vmuser vmuser 133955584 Jan  2 07:05 y

I have also reproduced this same issue using zfs-2.2.0 release (95785196f2) on kernel 6.5.12-100.fc37.x86_64 (Fedora 37).

This issue also occurs with sparse files - replacing the initial count=1022 with count=2 above has the same result.

Note that reproducing this is dependent on recordsize, but can be reproduced with other sizes by changing the bs= argument accordingly.

Also YMMV reproducing if the ZIL buffer sizes are different (and thus fit more than or fewer than 1022 block pointers per transaction).

Include any warning/errors/backtraces from the system logs

No warnings or errors in dmesg, /proc/spl/kstat/zfs/dbgmsg (same for debug build), or cp.

The text was updated successfully, but these errors were encountered:

rrevans · 2024-01-18T05:20:20Z

Polite bump and FYI @robn @amotin

This looks to me like a blocker for reenabling block cloning in any upcoming release (at least for OpenZFS on Linux).

I don't know if FreeBSD is similarly affected.

Vlad1mir-D · 2024-01-19T20:45:16Z

Sadly, this issue and #15715 still prevents me from enabling BRT on prod.

pjd · 2024-01-19T22:03:51Z

If FICLONE expects the clone to always be atomic, I can see two possible work-arounds (if we can recognize this is FICLONE what is calling zfs_clone_range()):

Try to squeeze more blocks into a single transaction, but we need to have a limit. If FICLONE asks to clone a file larger than the limit, we simply return an error.
If dmu_read_l0_bps() returns EAGAIN (block is not yet on disk), we wait for the current transaction group to be committed and we repeat the read.
The latter is less reliable.
Also, we could always return an error if this is FICLONE...

pjd · 2024-01-19T22:06:10Z

Sadly, this issue and #15715 still prevents me from enabling BRT on prod.

I'm confused. Is #15715 related to BRT?

rrevans · 2024-01-19T22:26:12Z

@pjd I investigated more here and upon a closer look FICLONE does not need to be atomic, but it does need to report success or failure for the entire file since FICLONE does not accept ranges.

So I believe all that is missing is a check for result < file size after the call to zfs_clone_range stops at the first dirty block.

behlendorf · 2024-01-19T22:58:05Z

Are we sure it doesn't need to be atomic? The documentation doesn't really say. If that's truly the case then we could wait for as many transactions as it takes to clone the entire file.

rrevans · 2024-01-20T03:53:14Z

Are we sure it doesn't need to be atomic?

Well I'm not sure, though atomic semantics are pretty rare in filesystems. Non-atomic behavior would be the same as write or copy_file_range partial write, but there's no way to return a length.

The lack of documented semantics for the output upon error implies it does not need to be atomic especially since the atomicity of the contents is documented:

ioctrl_fioclonerange(2):

Clones are atomic with regards to concurrent writes, so no locks need
to be taken to obtain a consistent cloned copy.

Allowing partial cloning would defeat this since a filesystem that can only partially clone cannot consistently clone an arbitrary range or entire file.

Meanwhile none of the defined error codes describe a partial clone outcome...

Documentation/filesystems/locking.rst repeats the atomicity of contents copying without describing errors:

->copy_file_range and ->remap_file_range implementations need to serialize
against modifications of file data while the operation is running.

Documentation/filesystems/vfs.rst seems to say both that it can return a partial result ("any bytes") and also cannot be fewer unless REMAP_FILE_CAN_SHORTEN is given:

The return value should the number of bytes
remapped, or the usual negative error code if errors occurred
before any bytes were remapped.
...
If REMAP_FILE_CAN_SHORTEN is set, the caller is
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason)

Reconciling this, it seems the vfs should not shorten the cloned range absent REMAP_FILE_CAN_SHORTEN to respect the caller trying to obtain an atomic clone. But, in the event of an unavoidable error, the vfs should report it by returning a partial length.

It seems allowable to treat dirty blocks as uncloneable and thus unavoidable, though it makes the feature less useful.

Nothing at all is said about the effect on the destination file upon error. Lacking a way to return a length, it seems reasonable that the output file state is undefined upon error i.e. not atomically overwritten. (It would be surprising to clobber the output in arbitrary ways, so maybe only states consistent with attempting clone should be allowed.)

Given all this, it seems adding a simple length check is sufficient, though a better implementation could do more. (Ideally recording pending clones in dirty dbufs - not forcing frequent performance-harming txg syncs.)

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifing that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov> Issue openzfs#15728

behlendorf · 2024-02-01T00:33:26Z

@rrevans @pjd @robn can you please take a look at the proposed fix in PR #15842. I'd appreciate the feedback.

Where I ended up on this is a version of 2. from Pawel's comment above.

If dmu_read_l0_bps() returns EAGAIN (block is not yet on disk), we wait for the current transaction group to be committed and we repeat the read.
The latter is less reliable.
Also, we could always return an error if this is FICLONE...

My reasoning is that with FICLONE / FICLONERANGE we want to make every reasonable effort to clone the bulks. Particularly given our inability to return any useful progress information to the caller. Waiting on the transaction groups when there's outstanding dirty records may be slow but it's what I'd personally expect in this case. We could consider adding yet-another-module-option if there's a scenario where we think this is less than ideal.

Nothing at all is said about the effect on the destination file upon error. Lacking a way to return a length, it seems reasonable that the output file state is undefined upon error i.e. not atomically overwritten. (It would be surprising to clobber the output in arbitrary ways, so maybe only states consistent with attempting clone should be allowed.)

As I mentioned in the PR this implementation doesn't provide strict atomicity but I think it comes close enough. It is possible to leave the destination file range in an undefined state if there's an error part way through the clone. Or if the node crashes during the system call. I considered zeroing that portion of the destination file on error but in the end decided that just added more complexity and it wasn't helpful anyway.

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov> Issue openzfs#15728

rrevans · 2024-02-01T14:21:46Z

Thanks @behlendorf for the fix. I'll rerun my reproducer and post results ~~here~~ in #15842.

I'm still concerned that forcing syncs is prohibitively expensive for some workloads like builds that may write and immediately copy various files, though having the feature work without regression is certainly strictly better.

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov> Issue openzfs#15728

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15728 Closes #15842

On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#15728 Closes openzfs#15842

rrevans added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 2, 2024

rrevans changed the title ~~cp --reflink=always truncates previously synced files with dirty blocks~~ BRT: Linux FICLONE truncates large files with dirty blocks Jan 18, 2024

behlendorf mentioned this issue Feb 1, 2024

BRT: Fix FICLONE/FICLONERANGE shortened copy #15842

Merged

13 tasks

behlendorf closed this as completed in 6dccdf5 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRT: Linux FICLONE truncates large files with dirty blocks #15728

BRT: Linux FICLONE truncates large files with dirty blocks #15728

rrevans commented Jan 2, 2024

rrevans commented Jan 18, 2024

Vlad1mir-D commented Jan 19, 2024

pjd commented Jan 19, 2024 •

edited

Loading

pjd commented Jan 19, 2024

rrevans commented Jan 19, 2024

behlendorf commented Jan 19, 2024

rrevans commented Jan 20, 2024

behlendorf commented Feb 1, 2024

rrevans commented Feb 1, 2024 •

edited

Loading

BRT: Linux FICLONE truncates large files with dirty blocks #15728

BRT: Linux FICLONE truncates large files with dirty blocks #15728

Comments

rrevans commented Jan 2, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

rrevans commented Jan 18, 2024

Vlad1mir-D commented Jan 19, 2024

pjd commented Jan 19, 2024 • edited Loading

pjd commented Jan 19, 2024

rrevans commented Jan 19, 2024

behlendorf commented Jan 19, 2024

rrevans commented Jan 20, 2024

behlendorf commented Feb 1, 2024

rrevans commented Feb 1, 2024 • edited Loading

pjd commented Jan 19, 2024 •

edited

Loading

rrevans commented Feb 1, 2024 •

edited

Loading