Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up deduplication #5721

Open
LaurentBonnaud opened this issue Mar 3, 2021 · 5 comments
Open

Speeding up deduplication #5721

LaurentBonnaud opened this issue Mar 3, 2021 · 5 comments
Milestone

Comments

@LaurentBonnaud
Copy link

Hi,

this approach for deduplication seems interesting:

https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf

It has led to a x2 speed-up in zstd:

facebook/zstd#2483

How about using it in borg-backup?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 4, 2021

@callegar
Copy link

callegar commented Mar 5, 2021

One way to speed up chunking could be not chunking at all certain file types. For instance, I think that there is no significant advantage in chunking certain types of compressed files, such as gzipped ones, jpeg, pngs and the like, for which it is extremely improbable to find companions offering a little binary delta, because even a small change in the compressed data can cause large and widespread changes in the compressed file. Other types of compressed files where the compression is practiced on items that are smaller than the whole file (e.g. zips, pdfs, some audio files) may stil benefit from chunking, though.

I wonder if it could make sense to teach borg to follow user indications on a file basis for chunking, similarly to how git can have attributes attached to files. This mechanism could also be used to avoid recompressing stuff that is already compressed.

Incidentally, not chunking files that do not benefit from it would leave a margin for using more aggressive chunking params for files that do benefit from it.

@ThomasWaldmann
Copy link
Member

We already had a similar discussion for compression and even had code for file-type-based compression. In the end, we removed that again for a simple "auto" compression decider that does not need a lot of configuration.

For the chunker, IIRC:

  • if a file is below the minimum chunk size, it immediately returns 1 chunk with the full file
  • for big files, we always need multiple chunks because the borg repo code can not store more than ~8MiB in one piece (this also makes sure we have manageable piece sizes for in-memory processing)
  • in 1.2, we'll have also a fixed-chunksize chunker that is very light on the cpu (e.g. for block devices, fixed-recordsize DBs, etc.). even has sparse file / sparse map support.
  • KISS (keep it simple & "stupid")

That said, speedups for CDC would be nice, just let us not make it too complicated.

@Glandos
Copy link

Glandos commented Dec 22, 2021

New kid in town: http://www.cloud-conf.net/ispa2021/proc/pdfs/ISPA-BDCloud-SocialCom-SustainCom2021-3mkuIWCJVSdKJpBYM7KEKW/264600a288/264600a288.pdf

I don't know of any implementation though.

@Glandos
Copy link

Glandos commented Oct 8, 2023

See also UltraCDC I described in #3026 (comment)

@ThomasWaldmann ThomasWaldmann added this to the 2.0.0rc1 milestone Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants