SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

cwedgwood · 2017-12-14T16:54:50Z

Type	Version/Name
Distribution Name
Distribution Version
Linux Kernel	4.14.x
Architecture	amd64
ZFS Version	ba88ef6bb
SPL Version	ed19bcc

Describe the problem you're observing

When using SEEK_DATA & SEEK_HOLE to map a file it works robustly if the file is not being updated (open elsewhere for write).

When it is open elsewhere for write and being updated, it's common to get wrong results, often a single data segment from [0, len]

Describe how to reproduce the problem

Create a sparse file, verify SEEK_DATA/SEEK_HOLE are working...

read the file (anywhere) and SEEK_DATA/SEEK_HOLE no long function until the atime update from the previous access has flushed

there is a (weak) argument for this in the case of pending file content updates; but i feel it's even weaker again if it's just causes by pending atime from RO access

bunder2015 · 2017-12-14T17:03:38Z

is this on gentoo?

#3125 (comment)
https://bugs.gentoo.org/show_bug.cgi?id=635002

cwedgwood · 2017-12-14T18:10:29Z

@bunder2015 no not gentoo, zfs from repo as per commit id's above

the release i'm using contains the above referenced fix (454365b)

behlendorf · 2017-12-14T20:39:50Z

@cwedgwood this behavior is an intentional design compromise, see #4306.

The problem is that when a file is open and contains dirty blocks it's difficult to determine exactly how those blocks will be written prior to actually writing them. New holes may be created, other holes filled, etc. In order to handle this dirty case gracefully ZFS falls back to the allowed safe behavior as described in the lseek(2) man page.

  For the purposes of these operations, a hole is  a  sequence  of  zeros
  that  (normally) has not been allocated in the underlying file storage.
  However, a file system is not obliged to report holes, so these  opera‐
  tions  are  not  a  guaranteed  mechanism for mapping the storage space
  actually allocated to a file.  (Furthermore, a sequence of  zeros  that
  actually has been written to the underlying storage may not be reported
  as a hole.)  In the simplest implementation, a file system can  support
  the  operations by making SEEK_HOLE always return the offset of the end
  of the file, and making SEEK_DATA always return offset (i.e.,  even  if
  the  location  referred to by offset is a hole, it can be considered to
  consist of data that is a sequence of zeros).

When the file isn't dirty and it's possible for the filesystem to give an authoritative answer it does so. It will never report data as a hole.

If you absolutely always need to have holes reported you can set the zfs_dmu_offset_next_sync module option to 1. This will force the file to be synced to disk so correct hole information can be returned. This will come with a potential performance penalty depending on your exact workload.

cwedgwood · 2017-12-15T02:03:24Z

@behlendorf it's always going to be vulnerable to races so i think it's reasonable for it to give some 'best effort' approximation

on test, i'm not entirely sure it's a case for dirty data ... it looks like it happens with recent reads (no writes) inside of a few seconds

i will try and get example code showing this (what I have now needs to be pared back)

behlendorf · 2017-12-15T02:45:21Z

I think it's reasonable for it to give some 'best effort' approximation

Yes, this could probably be further optimized to say only apply to the dirty regions of given file. The thing is we need to be absolutely certain that a hole is never reported where data is about to be written. Doing so can confuse some of the system tools like cp with unfortunate results, so we opted for the more conservative approach initially.

t looks like it happens with recent reads (no writes) inside of a few seconds

The window here is a full txg sync, so up to about 5s without writes is needed.

cwedgwood · 2017-12-15T05:03:36Z

@behlendorf you cannot prevent that in all cases for zfs or any other fs ... why pretend? updates/changes can always race with cp; if tools are to optimize in these cases they need to map afterwards, look at mtime/ctime, retry, error, cry or self-destruct as appropriate

ignoring rw/updates for now though

it happens on read-only ; here i'm using file to read the head of the file after counting the segments

$ ./seeker | grep segments
79.4% occupied in 427 segments

$ file /tank0/imagenet/ILSVRC2012_img_train.tar
/tank0/imagenet/ILSVRC2012_img_train.tar: POSIX tar archive (GNU)

$ ./seeker | grep segments
100.0% occupied in 1 segments

$ # delay a few seconds

$ ./seeker | grep segments
79.4% occupied in 427 segments

are you sure there isn't some lifetime issue between state in the linux vfs and the underlying zpl that means after a read (seeking doesn't seem to matter) we're not ending up with something that confuses things immediately after?

cwedgwood · 2017-12-15T05:05:06Z

@behlendorf btw, i'm aware i have shown the code that produces this, it's on me to make something smaller showing the issue

behlendorf · 2017-12-15T17:52:14Z

updates/changes can always race with cp;

Sure, no argument there. The best we can do is make sure the system call is handled atomically.

ending up with something that confuses things

Can you try disabling atime updates. Those could be dirtying the dnode on a read which might explain what you're seeing.

ryao · 2017-12-15T17:58:58Z

This sounds like a consequence of the optimization that made it into 0.7 that improved performance by not waiting on the txg commit. It was not clear to me at the time how to avoid that while maintaining a correct view and it still isn't. Maybe we could let the user decide via a kernel module parameter or dataset property. The behavior can be either fast or strict. My vote is on a module parameter unless the other platforms feel that it should be a dataset option.

behlendorf · 2017-12-15T18:04:22Z

@ryao the zfs_dmu_offset_next_sync does exactly this, 0=fast, 1=strict.

ryao · 2017-12-15T18:06:09Z

@behlendorf I forgot and neglected to read through the entire thread. Thanks for refreshing my memory.

@cwedgwood Did you try setting zfs_dmu_offset_next_sync=1 on the module? Your reply does not make it clear whether you did or did not. As for updates racing with cp, you might want to consider doing a snapshot of the dataset and copying from the snapshot directory in .zfs.

cwedgwood · 2017-12-15T18:42:36Z

@behlendorf

it looks like (pending) atime updates are causing it to be dirty and therefore not work as i expected; i will edit the issue to reflect this

code showing the problem; https://github.com/cwedgwood/unholey

i have not tested setting zfs_dmu_offset_next_sync=1 (i will)

cwedgwood · 2017-12-15T18:52:00Z

@ryao fwiw i'm not using cp, i suspect it was mentioned as a relevant example of where a race might occur and cause problems

using a snapshot isn't practical to 'map' a file for holes; the best i can do at present is put in logic to detect cases where zfs is involved and check the atime as well as mtime (i didn't test ctime for example in the case of a rename, i suspect that might be a problem too)

Toasterson · 2017-12-16T15:54:55Z

@cwedgwood I assume it was you who asked yesterday if somebody could run this on Illumos. Anyway the code itself does not run like this in Illumos. There is no NO_ATIME constant in the OS available.

The Output when removing the constant is:

newly created
-------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

after a delay
-------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

immediately after a read
------------------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

delay...
--------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

immediately after a read using NOATIME
--------------------------------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

Hope this helps
Greetings
Toasterson

Update the dirty check in dmu_offset_next() such that dnode's are only considered dirty for the purpose or reporting holes when there are pending data blocks or frees to be synced. This ensures that when there are only metadata updates to be synced (atime) that holes are reported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6958

cwedgwood changed the title ~~SEEK_DATA / SEEK_HOLE doesn't work on active files~~ SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates Dec 15, 2017

behlendorf added this to the 0.8.0 milestone Dec 15, 2017

behlendorf mentioned this issue Mar 15, 2019

Report holes when there are only metadata changes #8505

Merged

12 tasks

behlendorf closed this as completed in ec4f9b8 Mar 21, 2019

robn mentioned this issue Nov 26, 2023

some copied files are corrupted (chunks replaced by zeros) #15526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

cwedgwood commented Dec 14, 2017 •

edited

Loading

bunder2015 commented Dec 14, 2017 •

edited

Loading

cwedgwood commented Dec 14, 2017

behlendorf commented Dec 14, 2017

cwedgwood commented Dec 15, 2017

behlendorf commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

behlendorf commented Dec 15, 2017

ryao commented Dec 15, 2017 •

edited

Loading

behlendorf commented Dec 15, 2017

ryao commented Dec 15, 2017 •

edited

Loading

cwedgwood commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

Toasterson commented Dec 16, 2017

SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

Comments

cwedgwood commented Dec 14, 2017 • edited Loading

Describe the problem you're observing

Describe how to reproduce the problem

bunder2015 commented Dec 14, 2017 • edited Loading

cwedgwood commented Dec 14, 2017

behlendorf commented Dec 14, 2017

cwedgwood commented Dec 15, 2017

behlendorf commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

behlendorf commented Dec 15, 2017

ryao commented Dec 15, 2017 • edited Loading

behlendorf commented Dec 15, 2017

ryao commented Dec 15, 2017 • edited Loading

cwedgwood commented Dec 15, 2017

cwedgwood commented Dec 15, 2017

Toasterson commented Dec 16, 2017

cwedgwood commented Dec 14, 2017 •

edited

Loading

bunder2015 commented Dec 14, 2017 •

edited

Loading

ryao commented Dec 15, 2017 •

edited

Loading

ryao commented Dec 15, 2017 •

edited

Loading