-
-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the new (c)size ticket #2357
Comments
Comment by @enkore, moved from #2313: The main problem with csize is not so much compatibility problems or something like that, but this issue:
But that might be ok, since it's a contrived example. A worst case in the very literal sense. Adding a repository API that allows to only get the length of a chunk (and falling back to len(GET()) in the client if that API is not available) would avoid network cost, and mostly leave the I/O. Compatibility issues are minor, e.g. an older Borg will show too small compressed/dedup sizes in |
solving the special problems for unchanged-skipped filesprogress indication / correct size/csize in archived items:
currently, we only have the chunk ids there, not their sizes/csizes. we do not really have chunk content at hand here, so no information about
|
ProposalProblem (1) Requiring csize means that things like AdhocChunksCache don't work, but we want that. (2) csize in all references to a chunk also means that changing the chunk's compression means that stats can be slightly off depending on which version of the reference is seen first during cache sync. (3) csize creates a dependency on the chunk processing before a chunk reference can be stored Solution (1) Introduce csize=0 (compact encoding, one byte) for these. After a cache sync, the client iterates over the chunks cache and will retrieve all objects with csize=0 from the repository and set the csize according to the length of the retrieved object. A dedicated API to avoid transferring the object data itself may be added for this. This fixes the scenario:
(2) Disregard slightly off stats due to recompression, and incomplete stats (at archive creation time) with AdhocChunksCache. (3) The dependency of chunk references on csize and therefore chunk processing for a new chunk is actually an advantage, since it ensures that the system is always in a forward-consistent state: With the dependency on csize, the cache can only emit references to chunks that are already stored. Therefore, receiving a chunk reference from the cache implies that it will be contained within a repository commit initiated after receiving the reference. This simplifies reasoning about the system considerably, especially in a concurrent setting. |
@enkore If you want to be able to implement the AdhocChunkscache later, I guess your proposal needs to address size also, not just csize, see my comments above. Even with my idea, there is a problem if the files cache is lost. |
A first solution could be to just not use a files cache in this case. This is may be a bit annoying with 1.1, though a quick back of the envelope calculation suggests that in many cases it would still be worth it. E.g. I know that with --no-files-cache a system backup needs about 15 minutes, but a cache sync to one of my larger repos takes longer than that. So --no-cache-sync (implying --no-files-cache) would still be faster. And with 1.2 this becomes even less of an issue. |
#2654 implements my proposal above to the word. |
This is now in borg2 branch, so the Update: merged into master now. |
Note: size, csize, ctype, clevel is now available as separate encrypted metadata via |
IIRC / AFAIK, there is nothing left we need to do right now, but we could do some improvements later:
|
Closing this, see comment above. |
scope of this ticket
let's concentrate here on the issue of csize (and also size) information in the items' chunks lists, in the chunks and files cache. no crypto or other discussion in here, let's stay focussed.
csize
the main issue is that csize is not a direct function of the data, it also depends on compression and encryption (and other overhead) that is applied to the data. as both might change (and thus csize might change) while the chunk still contains the same (plaintext) content and has the same id, it is an annoyance to have csize in the chunks lists of archived items.
size
we must have chunk size information in the chunks lists of archived items for the case we lose multiple chunks in the repo - so we can replace them with all-zero chunks of same length. size is a direct function of the data, so no problem here if we change compression/encryption/overhead.
timing of size / csize computation
where is chunk size/csize (not) stored?
where is size/csize used?
Archive.info
-> limits -> max_archive_size andArchive.__str__
Cache.__str__
.chunks_stored_sizeThe text was updated successfully, but these errors were encountered: