handle sparse files #256

ThomasWaldmann · 2015-03-31T22:36:28Z

Maybe sparse files could be dealt with intelligently (not just backup holes as lots of zeros, not extract holes as lots of zeros [but as sparse files]).

http://librelist.com/browser//attic/2014/11/28/handling-of-sparse-files/#0aa400e0ada2cc4ec8656310cff938d1

ThomasWaldmann · 2015-04-01T00:47:14Z

http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/commit/?id=323a26378dbe04eee35eb5bfa856fb9d3d03a5c9

Extract: Like that, but rather avoid creating lots of zeros-string garbage. Looks like this will always create sparse files if it detects whole-zero chunks. Is this a problem?

Create: zeros will be deduplicated and compressed, so maybe no special handling needed.

horihel · 2015-04-01T13:39:47Z

scanning, deduplicating and compressing gigabytes of zeroes is slow. special handling would speed up handling of (for example) VM-images greatly.

kamalmarhubi · 2015-04-01T14:38:15Z

Looks like this will always create sparse files if it detects whole-zero chunks. Is this a problem?

This could be a problem if it's important that the file be contiguous. Ideally the sparseness is independent of file contents.

ThomasWaldmann · 2015-04-01T14:44:23Z

@kamalmarhubi ah, right. If you intentionally create a non-sparse vm raw disk image, you don't want to have it made sparse by your backup/restore. So maybe the obnam way is not quite right in that respect.

@horihel it's not that slow, at least not with lz4. :)

BTW, I do not yet know how to detect the holes and that would be needed to differentiate hole zeros from real, on-disk zeros.

kamalmarhubi · 2015-04-01T14:49:29Z

BTW, I do not yet know how to detect the holes and that would be needed to differentiate hole zeros from real, on-disk zeros.

I'm not sure either, and it would likely be at least OS-dependent, and perhaps FS-dependent. I did come across this discussion on StackOverflow which points out some approaches:
http://stackoverflow.com/questions/21499451/c-linux-sparse-file-how-to-check-if-file-is-sparse-and-print-0-filled-disk-bl

kamalmarhubi · 2015-04-01T14:54:09Z

Ah, the stat field names are the same on Linux and FreeBSD (st_size, st_blksize, and st_blocks), so that's promising:

http://linux.die.net/man/2/stat
https://www.freebsd.org/cgi/man.cgi?query=stat&sektion=2

horihel · 2015-04-01T15:01:50Z

well, for sparse extraction (and many other things) I always look at the "gold" standard of backups: tar.

it looks like tar marks sparse blocks explicitely - so upon recreation only the parts/files that were sparse in the original will be sparse on extraction.
this is sensible, because in the case of libvirt (for example) the VM images are mostly sparse, but some parts of the file will be preallocated. If attic would take the dumb approach, just creating all zeroes sparsely, then the preallocations would be lost.

horihel · 2015-04-01T15:03:34Z

upon reading the doc a second time i'm actually not so sure any more if tar is that smart :)

kamalmarhubi · 2015-04-01T15:06:14Z

GNU tar manual section on sparse files

ThomasWaldmann · 2015-04-02T16:03:59Z

Considering the importance of this for VM backups / restores, I'll work on this next.

kamalmarhubi · 2015-04-03T15:35:55Z

Do you have thoughts on how you'll go about it? Near as I can tell, the best you can do for generic sparse file detection is looking at stat output. This will tell you that a file has holes, but not where. Anything better than that seems to require FS-specific code / tools, like dump / restore.

For frequent VM users: is it likely to matter if an allocated block of zeros gets replaced with an unallocated block on restore?

ThomasWaldmann · 2015-04-03T16:07:12Z

detect (or not, maybe not needed) via stat.*

fseek.SEEK_HOLE and .SEEK_DATA to find holes and data - does require a recent (>=3.8) linux kernel.

holes -> low space usage / space grows on demand

no holes, but zeros -> contiguous block allocation on disk, might have better perf.

kamalmarhubi · 2015-04-03T16:12:29Z

Oh very nice and good to know about.

ThomasWaldmann · 2015-04-11T17:51:18Z

A little problem (esp. concerning compatibility) is that attic just stores raw file data into the chunks and the sum of chunks is the file's content. There is no chunk metadata.

For the sparse file support (including being able to restore sparse files to the exact same state as they were found), it would need some metadata, e.g. like this:

chunk := hole_length=0, data # for data, length implicitely given
chunk := hole_length=N (for a hole of N zero bytes)

Any better ideas?

ThomasWaldmann · 2015-04-11T18:55:44Z

I wrote some code that reads all (sparse) files given as arguments (read-only, avoiding spoiling the OS cache) and prints out the data and hole areas.

You could do me some favour and run it on your sparse files (e.g. VM disk images) - especially if you run some other OS than I do:

python3.3 sparsetest.py /vm_disks/*.raw

And then just check if it raises any assertion errors. Or tells anything unexpected.

It works on Python 3.3+ (on 3.2, it will not find holes - that is expected) and Ubuntu Linux 14.04.

http://paste.thinkmo.de/jzMCGoCx#sparsetest.py

JuergenBS · 2015-04-14T14:36:12Z

Tested against KVM qcow2 sparse file on Debian Jessie with python 3.4.2-2.
No assertion errors and nothing unexpected.

You are currently trying to reproduce an exact copy of the sparse file. Have you thought about an approximation of a sparse file? It would be possible to precalculate the hashes of all-zero-bytes-chunks.
If sparse file handling is enabled by the user, attic could restore those hashes as holes.

ThomasWaldmann · 2015-04-14T16:05:19Z

@JuergenBS yes, the reasons are already outlined above.

ThomasWaldmann · 2015-04-15T12:08:29Z

As the exact reproduction of the holes of sparse files would need the above mentioned deeper and bigger changes in attic, I first implemented a simpler approach that just restores all-zero chunks as sparse - no matter how they were originally represented.

See PR #284.

Update: there is now a --sparse cmdline option to say whether one wants no sparse files (default) or spare files (--sparse) when restoring all-zero chunks.

ThomasWaldmann mentioned this issue May 9, 2015

st_mtime_ns precision or rounding related test failure #304

Open

maltefiala mentioned this issue May 14, 2015

Dealing with attic issues borgbackup/borg#5

Closed

ThomasWaldmann mentioned this issue May 15, 2015

advanced sparse file support borgbackup/borg#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle sparse files #256

handle sparse files #256

ThomasWaldmann commented Mar 31, 2015

ThomasWaldmann commented Apr 1, 2015

horihel commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

ThomasWaldmann commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

horihel commented Apr 1, 2015

horihel commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

ThomasWaldmann commented Apr 2, 2015

kamalmarhubi commented Apr 3, 2015

ThomasWaldmann commented Apr 3, 2015

kamalmarhubi commented Apr 3, 2015

ThomasWaldmann commented Apr 11, 2015

ThomasWaldmann commented Apr 11, 2015

JuergenBS commented Apr 14, 2015

ThomasWaldmann commented Apr 14, 2015

ThomasWaldmann commented Apr 15, 2015

handle sparse files #256

handle sparse files #256

Comments

ThomasWaldmann commented Mar 31, 2015

ThomasWaldmann commented Apr 1, 2015

horihel commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

ThomasWaldmann commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

horihel commented Apr 1, 2015

horihel commented Apr 1, 2015

kamalmarhubi commented Apr 1, 2015

ThomasWaldmann commented Apr 2, 2015

kamalmarhubi commented Apr 3, 2015

ThomasWaldmann commented Apr 3, 2015

kamalmarhubi commented Apr 3, 2015

ThomasWaldmann commented Apr 11, 2015

ThomasWaldmann commented Apr 11, 2015

JuergenBS commented Apr 14, 2015

ThomasWaldmann commented Apr 14, 2015

ThomasWaldmann commented Apr 15, 2015