-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle sparse files #256
Comments
http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/commit/?id=323a26378dbe04eee35eb5bfa856fb9d3d03a5c9 Extract: Like that, but rather avoid creating lots of zeros-string garbage. Looks like this will always create sparse files if it detects whole-zero chunks. Is this a problem? Create: zeros will be deduplicated and compressed, so maybe no special handling needed. |
scanning, deduplicating and compressing gigabytes of zeroes is slow. special handling would speed up handling of (for example) VM-images greatly. |
This could be a problem if it's important that the file be contiguous. Ideally the sparseness is independent of file contents. |
@kamalmarhubi ah, right. If you intentionally create a non-sparse vm raw disk image, you don't want to have it made sparse by your backup/restore. So maybe the obnam way is not quite right in that respect. @horihel it's not that slow, at least not with lz4. :) BTW, I do not yet know how to detect the holes and that would be needed to differentiate hole zeros from real, on-disk zeros. |
I'm not sure either, and it would likely be at least OS-dependent, and perhaps FS-dependent. I did come across this discussion on StackOverflow which points out some approaches: |
Ah, the stat field names are the same on Linux and FreeBSD ( http://linux.die.net/man/2/stat |
well, for sparse extraction (and many other things) I always look at the "gold" standard of backups: tar. it looks like tar marks sparse blocks explicitely - so upon recreation only the parts/files that were sparse in the original will be sparse on extraction. |
upon reading the doc a second time i'm actually not so sure any more if tar is that smart :) |
Considering the importance of this for VM backups / restores, I'll work on this next. |
Do you have thoughts on how you'll go about it? Near as I can tell, the best you can do for generic sparse file detection is looking at For frequent VM users: is it likely to matter if an allocated block of zeros gets replaced with an unallocated block on restore? |
detect (or not, maybe not needed) via stat.* fseek.SEEK_HOLE and .SEEK_DATA to find holes and data - does require a recent (>=3.8) linux kernel. holes -> low space usage / space grows on demand no holes, but zeros -> contiguous block allocation on disk, might have better perf. |
Oh very nice and good to know about. |
A little problem (esp. concerning compatibility) is that attic just stores raw file data into the chunks and the sum of chunks is the file's content. There is no chunk metadata. For the sparse file support (including being able to restore sparse files to the exact same state as they were found), it would need some metadata, e.g. like this:
Any better ideas? |
I wrote some code that reads all (sparse) files given as arguments (read-only, avoiding spoiling the OS cache) and prints out the data and hole areas. You could do me some favour and run it on your sparse files (e.g. VM disk images) - especially if you run some other OS than I do:
And then just check if it raises any assertion errors. Or tells anything unexpected. It works on Python 3.3+ (on 3.2, it will not find holes - that is expected) and Ubuntu Linux 14.04. |
Tested against KVM qcow2 sparse file on Debian Jessie with python 3.4.2-2. You are currently trying to reproduce an exact copy of the sparse file. Have you thought about an approximation of a sparse file? It would be possible to precalculate the hashes of all-zero-bytes-chunks. |
@JuergenBS yes, the reasons are already outlined above. |
As the exact reproduction of the holes of sparse files would need the above mentioned deeper and bigger changes in attic, I first implemented a simpler approach that just restores all-zero chunks as sparse - no matter how they were originally represented. See PR #284. Update: there is now a --sparse cmdline option to say whether one wants no sparse files (default) or spare files (--sparse) when restoring all-zero chunks. |
Maybe sparse files could be dealt with intelligently (not just backup holes as lots of zeros, not extract holes as lots of zeros [but as sparse files]).
http://librelist.com/browser//attic/2014/11/28/handling-of-sparse-files/#0aa400e0ada2cc4ec8656310cff938d1
The text was updated successfully, but these errors were encountered: