Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates #6958

Closed
cwedgwood opened this issue Dec 14, 2017 · 14 comments
Milestone

Comments

@cwedgwood
Copy link
Contributor

cwedgwood commented Dec 14, 2017

Type Version/Name
Distribution Name
Distribution Version
Linux Kernel 4.14.x
Architecture amd64
ZFS Version ba88ef6bb
SPL Version ed19bcc

Describe the problem you're observing

When using SEEK_DATA & SEEK_HOLE to map a file it works robustly if the file is not being updated (open elsewhere for write).

When it is open elsewhere for write and being updated, it's common to get wrong results, often a single data segment from [0, len]

Describe how to reproduce the problem

Create a sparse file, verify SEEK_DATA/SEEK_HOLE are working...

read the file (anywhere) and SEEK_DATA/SEEK_HOLE no long function until the atime update from the previous access has flushed

there is a (weak) argument for this in the case of pending file content updates; but i feel it's even weaker again if it's just causes by pending atime from RO access

@bunder2015
Copy link
Contributor

bunder2015 commented Dec 14, 2017

@cwedgwood
Copy link
Contributor Author

@bunder2015 no not gentoo, zfs from repo as per commit id's above

the release i'm using contains the above referenced fix (454365b)

@behlendorf
Copy link
Contributor

@cwedgwood this behavior is an intentional design compromise, see #4306.

The problem is that when a file is open and contains dirty blocks it's difficult to determine exactly how those blocks will be written prior to actually writing them. New holes may be created, other holes filled, etc. In order to handle this dirty case gracefully ZFS falls back to the allowed safe behavior as described in the lseek(2) man page.

  For the purposes of these operations, a hole is  a  sequence  of  zeros
  that  (normally) has not been allocated in the underlying file storage.
  However, a file system is not obliged to report holes, so these  opera‐
  tions  are  not  a  guaranteed  mechanism for mapping the storage space
  actually allocated to a file.  (Furthermore, a sequence of  zeros  that
  actually has been written to the underlying storage may not be reported
  as a hole.)  In the simplest implementation, a file system can  support
  the  operations by making SEEK_HOLE always return the offset of the end
  of the file, and making SEEK_DATA always return offset (i.e.,  even  if
  the  location  referred to by offset is a hole, it can be considered to
  consist of data that is a sequence of zeros).

When the file isn't dirty and it's possible for the filesystem to give an authoritative answer it does so. It will never report data as a hole.

If you absolutely always need to have holes reported you can set the zfs_dmu_offset_next_sync module option to 1. This will force the file to be synced to disk so correct hole information can be returned. This will come with a potential performance penalty depending on your exact workload.

@cwedgwood
Copy link
Contributor Author

@behlendorf it's always going to be vulnerable to races so i think it's reasonable for it to give some 'best effort' approximation

on test, i'm not entirely sure it's a case for dirty data ... it looks like it happens with recent reads (no writes) inside of a few seconds

i will try and get example code showing this (what I have now needs to be pared back)

@behlendorf
Copy link
Contributor

I think it's reasonable for it to give some 'best effort' approximation

Yes, this could probably be further optimized to say only apply to the dirty regions of given file. The thing is we need to be absolutely certain that a hole is never reported where data is about to be written. Doing so can confuse some of the system tools like cp with unfortunate results, so we opted for the more conservative approach initially.

t looks like it happens with recent reads (no writes) inside of a few seconds

The window here is a full txg sync, so up to about 5s without writes is needed.

@cwedgwood
Copy link
Contributor Author

@behlendorf you cannot prevent that in all cases for zfs or any other fs ... why pretend? updates/changes can always race with cp; if tools are to optimize in these cases they need to map afterwards, look at mtime/ctime, retry, error, cry or self-destruct as appropriate

ignoring rw/updates for now though

it happens on read-only ; here i'm using file to read the head of the file after counting the segments

$ ./seeker | grep segments
79.4% occupied in 427 segments

$ file /tank0/imagenet/ILSVRC2012_img_train.tar
/tank0/imagenet/ILSVRC2012_img_train.tar: POSIX tar archive (GNU)

$ ./seeker | grep segments
100.0% occupied in 1 segments

$ # delay a few seconds

$ ./seeker | grep segments
79.4% occupied in 427 segments

are you sure there isn't some lifetime issue between state in the linux vfs and the underlying zpl that means after a read (seeking doesn't seem to matter) we're not ending up with something that confuses things immediately after?

@cwedgwood
Copy link
Contributor Author

@behlendorf btw, i'm aware i have shown the code that produces this, it's on me to make something smaller showing the issue

@behlendorf
Copy link
Contributor

updates/changes can always race with cp;

Sure, no argument there. The best we can do is make sure the system call is handled atomically.

ending up with something that confuses things

Can you try disabling atime updates. Those could be dirtying the dnode on a read which might explain what you're seeing.

@ryao
Copy link
Contributor

ryao commented Dec 15, 2017

This sounds like a consequence of the optimization that made it into 0.7 that improved performance by not waiting on the txg commit. It was not clear to me at the time how to avoid that while maintaining a correct view and it still isn't. Maybe we could let the user decide via a kernel module parameter or dataset property. The behavior can be either fast or strict. My vote is on a module parameter unless the other platforms feel that it should be a dataset option.

@behlendorf
Copy link
Contributor

@ryao the zfs_dmu_offset_next_sync does exactly this, 0=fast, 1=strict.

@ryao
Copy link
Contributor

ryao commented Dec 15, 2017

@behlendorf I forgot and neglected to read through the entire thread. Thanks for refreshing my memory.

@cwedgwood Did you try setting zfs_dmu_offset_next_sync=1 on the module? Your reply does not make it clear whether you did or did not. As for updates racing with cp, you might want to consider doing a snapshot of the dataset and copying from the snapshot directory in .zfs.

@cwedgwood
Copy link
Contributor Author

@behlendorf

it looks like (pending) atime updates are causing it to be dirty and therefore not work as i expected; i will edit the issue to reflect this

code showing the problem; https://github.com/cwedgwood/unholey

i have not tested setting zfs_dmu_offset_next_sync=1 (i will)

@cwedgwood cwedgwood changed the title SEEK_DATA / SEEK_HOLE doesn't work on active files SEEK_DATA / SEEK_HOLE do not work as expected with RO if there are pending atime updates Dec 15, 2017
@cwedgwood
Copy link
Contributor Author

@ryao fwiw i'm not using cp, i suspect it was mentioned as a relevant example of where a race might occur and cause problems

using a snapshot isn't practical to 'map' a file for holes; the best i can do at present is put in logic to detect cases where zfs is involved and check the atime as well as mtime (i didn't test ctime for example in the case of a rename, i suspect that might be a problem too)

@behlendorf behlendorf added this to the 0.8.0 milestone Dec 15, 2017
@Toasterson
Copy link

@cwedgwood I assume it was you who asked yesterday if somebody could run this on Illumos. Anyway the code itself does not run like this in Illumos. There is no NO_ATIME constant in the OS available.

The Output when removing the constant is:

newly created
-------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

after a delay
-------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

immediately after a read
------------------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

delay...
--------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

immediately after a read using NOATIME
--------------------------------------
HOLE            0      1048576 (     1048576)
DATA      1048576      1179648 (      131072)
HOLE      1179648      2097152 (      917504)
DATA      2097152      2228224 (      131072)           4 segments

Hope this helps
Greetings
Toasterson

behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 15, 2019
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6958
c0d3z3r0 pushed a commit to c0d3z3r0/zfs that referenced this issue Mar 16, 2019
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6958
c0d3z3r0 pushed a commit to c0d3z3r0/zfs that referenced this issue Mar 16, 2019
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6958
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 20, 2019
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants