Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle sparse files #256

Open
ThomasWaldmann opened this issue Mar 31, 2015 · 18 comments
Open

handle sparse files #256

ThomasWaldmann opened this issue Mar 31, 2015 · 18 comments

Comments

@ThomasWaldmann
Copy link
Contributor

Maybe sparse files could be dealt with intelligently (not just backup holes as lots of zeros, not extract holes as lots of zeros [but as sparse files]).

http://librelist.com/browser//attic/2014/11/28/handling-of-sparse-files/#0aa400e0ada2cc4ec8656310cff938d1

@ThomasWaldmann
Copy link
Contributor Author

http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/commit/?id=323a26378dbe04eee35eb5bfa856fb9d3d03a5c9

Extract: Like that, but rather avoid creating lots of zeros-string garbage. Looks like this will always create sparse files if it detects whole-zero chunks. Is this a problem?

Create: zeros will be deduplicated and compressed, so maybe no special handling needed.

@horihel
Copy link

horihel commented Apr 1, 2015

scanning, deduplicating and compressing gigabytes of zeroes is slow. special handling would speed up handling of (for example) VM-images greatly.

@kamalmarhubi
Copy link

Looks like this will always create sparse files if it detects whole-zero chunks. Is this a problem?

This could be a problem if it's important that the file be contiguous. Ideally the sparseness is independent of file contents.

@ThomasWaldmann
Copy link
Contributor Author

@kamalmarhubi ah, right. If you intentionally create a non-sparse vm raw disk image, you don't want to have it made sparse by your backup/restore. So maybe the obnam way is not quite right in that respect.

@horihel it's not that slow, at least not with lz4. :)

BTW, I do not yet know how to detect the holes and that would be needed to differentiate hole zeros from real, on-disk zeros.

@kamalmarhubi
Copy link

BTW, I do not yet know how to detect the holes and that would be needed to differentiate hole zeros from real, on-disk zeros.

I'm not sure either, and it would likely be at least OS-dependent, and perhaps FS-dependent. I did come across this discussion on StackOverflow which points out some approaches:
http://stackoverflow.com/questions/21499451/c-linux-sparse-file-how-to-check-if-file-is-sparse-and-print-0-filled-disk-bl

@kamalmarhubi
Copy link

Ah, the stat field names are the same on Linux and FreeBSD (st_size, st_blksize, and st_blocks), so that's promising:

http://linux.die.net/man/2/stat
https://www.freebsd.org/cgi/man.cgi?query=stat&sektion=2

@horihel
Copy link

horihel commented Apr 1, 2015

well, for sparse extraction (and many other things) I always look at the "gold" standard of backups: tar.

it looks like tar marks sparse blocks explicitely - so upon recreation only the parts/files that were sparse in the original will be sparse on extraction.
this is sensible, because in the case of libvirt (for example) the VM images are mostly sparse, but some parts of the file will be preallocated. If attic would take the dumb approach, just creating all zeroes sparsely, then the preallocations would be lost.

@horihel
Copy link

horihel commented Apr 1, 2015

upon reading the doc a second time i'm actually not so sure any more if tar is that smart :)

@kamalmarhubi
Copy link

@ThomasWaldmann
Copy link
Contributor Author

Considering the importance of this for VM backups / restores, I'll work on this next.

@kamalmarhubi
Copy link

Do you have thoughts on how you'll go about it? Near as I can tell, the best you can do for generic sparse file detection is looking at stat output. This will tell you that a file has holes, but not where. Anything better than that seems to require FS-specific code / tools, like dump / restore.

For frequent VM users: is it likely to matter if an allocated block of zeros gets replaced with an unallocated block on restore?

@ThomasWaldmann
Copy link
Contributor Author

detect (or not, maybe not needed) via stat.*

fseek.SEEK_HOLE and .SEEK_DATA to find holes and data - does require a recent (>=3.8) linux kernel.

holes -> low space usage / space grows on demand

no holes, but zeros -> contiguous block allocation on disk, might have better perf.

@kamalmarhubi
Copy link

Oh very nice and good to know about.

@ThomasWaldmann
Copy link
Contributor Author

A little problem (esp. concerning compatibility) is that attic just stores raw file data into the chunks and the sum of chunks is the file's content. There is no chunk metadata.

For the sparse file support (including being able to restore sparse files to the exact same state as they were found), it would need some metadata, e.g. like this:

  • chunk := hole_length=0, data # for data, length implicitely given
  • chunk := hole_length=N (for a hole of N zero bytes)

Any better ideas?

@ThomasWaldmann
Copy link
Contributor Author

I wrote some code that reads all (sparse) files given as arguments (read-only, avoiding spoiling the OS cache) and prints out the data and hole areas.

You could do me some favour and run it on your sparse files (e.g. VM disk images) - especially if you run some other OS than I do:

python3.3 sparsetest.py /vm_disks/*.raw

And then just check if it raises any assertion errors. Or tells anything unexpected.

It works on Python 3.3+ (on 3.2, it will not find holes - that is expected) and Ubuntu Linux 14.04.

http://paste.thinkmo.de/jzMCGoCx#sparsetest.py

@JuergenBS
Copy link

Tested against KVM qcow2 sparse file on Debian Jessie with python 3.4.2-2.
No assertion errors and nothing unexpected.

You are currently trying to reproduce an exact copy of the sparse file. Have you thought about an approximation of a sparse file? It would be possible to precalculate the hashes of all-zero-bytes-chunks.
If sparse file handling is enabled by the user, attic could restore those hashes as holes.

@ThomasWaldmann
Copy link
Contributor Author

@JuergenBS yes, the reasons are already outlined above.

@ThomasWaldmann
Copy link
Contributor Author

As the exact reproduction of the holes of sparse files would need the above mentioned deeper and bigger changes in attic, I first implemented a simpler approach that just restores all-zero chunks as sparse - no matter how they were originally represented.

See PR #284.

Update: there is now a --sparse cmdline option to say whether one wants no sparse files (default) or spare files (--sparse) when restoring all-zero chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants