-
-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up deduplication #5721
Comments
Sounds interesting. Update: some more links: |
One way to speed up chunking could be not chunking at all certain file types. For instance, I think that there is no significant advantage in chunking certain types of compressed files, such as gzipped ones, jpeg, pngs and the like, for which it is extremely improbable to find companions offering a little binary delta, because even a small change in the compressed data can cause large and widespread changes in the compressed file. Other types of compressed files where the compression is practiced on items that are smaller than the whole file (e.g. zips, pdfs, some audio files) may stil benefit from chunking, though. I wonder if it could make sense to teach borg to follow user indications on a file basis for chunking, similarly to how git can have attributes attached to files. This mechanism could also be used to avoid recompressing stuff that is already compressed. Incidentally, not chunking files that do not benefit from it would leave a margin for using more aggressive chunking params for files that do benefit from it. |
We already had a similar discussion for compression and even had code for file-type-based compression. In the end, we removed that again for a simple "auto" compression decider that does not need a lot of configuration. For the chunker, IIRC:
That said, speedups for CDC would be nice, just let us not make it too complicated. |
New kid in town: http://www.cloud-conf.net/ispa2021/proc/pdfs/ISPA-BDCloud-SocialCom-SustainCom2021-3mkuIWCJVSdKJpBYM7KEKW/264600a288/264600a288.pdf I don't know of any implementation though. |
See also UltraCDC I described in #3026 (comment) |
Hi,
this approach for deduplication seems interesting:
https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
It has led to a x2 speed-up in zstd:
facebook/zstd#2483
How about using it in borg-backup?
The text was updated successfully, but these errors were encountered: