Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the new (c)size ticket #2357

Closed
ThomasWaldmann opened this issue Mar 30, 2017 · 11 comments
Closed

the new (c)size ticket #2357

ThomasWaldmann opened this issue Mar 30, 2017 · 11 comments

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 30, 2017

scope of this ticket

let's concentrate here on the issue of csize (and also size) information in the items' chunks lists, in the chunks and files cache. no crypto or other discussion in here, let's stay focussed.

csize

the main issue is that csize is not a direct function of the data, it also depends on compression and encryption (and other overhead) that is applied to the data. as both might change (and thus csize might change) while the chunk still contains the same (plaintext) content and has the same id, it is an annoyance to have csize in the chunks lists of archived items.

size

we must have chunk size information in the chunks lists of archived items for the case we lose multiple chunks in the repo - so we can replace them with all-zero chunks of same length. size is a direct function of the data, so no problem here if we change compression/encryption/overhead.

timing of size / csize computation

  • size is computed early, after the chunker has cut the chunks: len(chunk)
  • csize is computed late, after compression, after encryption/authentication. note: this can lead to a race (wait) condition in multithreaded processing.

where is chunk size/csize (not) stored?

  • repo: the current PUT entry in the segment file contains csize in the length information. no size available here! also, neither size nor csize is in the repo index.
  • archive: item.chunks = [(id, size, csize), ...]
  • chunks cache: id -> (refcount, size, csize)
  • files cache: no size/csize here: path_hash -> (file_size, ino, mtime, chunks=[id, id, ...])

where is size/csize used?

  • size dsize csize dcsize placeholders
  • Statistics class + show_progress
  • chunk_incref (gets size/csize from chunks cache - important for archiving unchanged files)
  • csize: Archive.info -> limits -> max_archive_size and Archive.__str__
  • csize: Cache.__str__ .chunks_stored_size
  • size: do_diff sum_chunk_sizes (to show sum of lengths of added/removed chunks of a file)
  • size: borg check size consistency check item.size == sum(chunks size)
  • tests
This was referenced Mar 30, 2017
@ThomasWaldmann
Copy link
Member Author

Comment by @enkore, moved from #2313:

The main problem with csize is not so much compatibility problems or something like that, but this issue:

  • Create an archive with some new chunks, they will have csize set
  • Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks
  • Delete the first archive
  • Sync the cache
  • Then the cache cannot know the csize of these chunks
  • So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

But that might be ok, since it's a contrived example. A worst case in the very literal sense. Adding a repository API that allows to only get the length of a chunk (and falling back to len(GET()) in the client if that API is not available) would avoid network cost, and mostly leave the I/O.

Compatibility issues are minor, e.g. an older Borg will show too small compressed/dedup sizes in borg info.

@ThomasWaldmann
Copy link
Member Author

solving the special problems for unchanged-skipped files

progress indication / correct size/csize in archived items:

  • we get their chunk IDs list from the files cache.

currently, we only have the chunk ids there, not their sizes/csizes.

we do not really have chunk content at hand here, so no information about
size or csize - we get them via chunk_incref from chunks index for progress
indication and also to generate the chunks list for the item (with correct
size and csize).

  • in the adhoc chunks cache (see borg create without a cache – Prototype, Do Not Merge #2350), we do not have correct size or csize information!

  • we could modify files cache to include size (csize), then progress indication
    would work normally for unchanged files and also the archived items would
    have correct size/csize infos in their chunks list.

    memory usage: 32b str -> tuple(32b str, int, int)
    it is msgpacked, so maybe not that bad.

  • this would need to update chunks cache entries with unknown (c)size to the
    known (c)size from the files cache.

@enkore
Copy link
Contributor

enkore commented Apr 6, 2017

Proposal

Problem

(1) Requiring csize means that things like AdhocChunksCache don't work, but we want that.

(2) csize in all references to a chunk also means that changing the chunk's compression means that stats can be slightly off depending on which version of the reference is seen first during cache sync.

(3) csize creates a dependency on the chunk processing before a chunk reference can be stored

Solution

(1) Introduce csize=0 (compact encoding, one byte) for these.

After a cache sync, the client iterates over the chunks cache and will retrieve all objects with csize=0 from the repository and set the csize according to the length of the retrieved object. A dedicated API to avoid transferring the object data itself may be added for this. This fixes the scenario:

  • Create an archive with some new chunks, they will have csize set
  • Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks
  • Delete the first archive
  • Sync the cache
  • Then the cache cannot know the csize of these chunks
  • So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

(2) Disregard slightly off stats due to recompression, and incomplete stats (at archive creation time) with AdhocChunksCache.

(3) The dependency of chunk references on csize and therefore chunk processing for a new chunk is actually an advantage, since it ensures that the system is always in a forward-consistent state: With the dependency on csize, the cache can only emit references to chunks that are already stored. Therefore, receiving a chunk reference from the cache implies that it will be contained within a repository commit initiated after receiving the reference. This simplifies reasoning about the system considerably, especially in a concurrent setting.

@ThomasWaldmann
Copy link
Member Author

@enkore If you want to be able to implement the AdhocChunkscache later, I guess your proposal needs to address size also, not just csize, see my comments above. Even with my idea, there is a problem if the files cache is lost.

@enkore
Copy link
Contributor

enkore commented Apr 6, 2017

A first solution could be to just not use a files cache in this case.

This is may be a bit annoying with 1.1, though a quick back of the envelope calculation suggests that in many cases it would still be worth it. E.g. I know that with --no-files-cache a system backup needs about 15 minutes, but a cache sync to one of my larger repos takes longer than that. So --no-cache-sync (implying --no-files-cache) would still be faster. And with 1.2 this becomes even less of an issue.

@enkore
Copy link
Contributor

enkore commented Jun 10, 2017

#2654 implements my proposal above to the word.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jun 12, 2022

#6763 removes csize everywhere.

something related is still in the entry length of a segment file entry and might come to the repo index via #6705.

@ThomasWaldmann ThomasWaldmann self-assigned this Jun 12, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: lithium, 2.0.0a2 Jun 26, 2022
@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jul 4, 2022

This is now in borg2 branch, so the csize related issues are solved by removing it everywhere.

Update: merged into master now.

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0a2, 2.0.0b1 Jul 4, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b1, 2.0.0b2 Aug 4, 2022
@ThomasWaldmann
Copy link
Member Author

Note: size, csize, ctype, clevel is now available as separate encrypted metadata via repo.get(id, read_data=False) and RepoObj.parse_meta(chunk).

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b2, 2.0.0b3 Sep 9, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b3, 2.0.0b4 Sep 29, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b4, 2.0.0b5 Nov 26, 2022
@ThomasWaldmann
Copy link
Member Author

IIRC / AFAIK, there is nothing left we need to do right now, but we could do some improvements later:

  • improve the adhoc chunks cache
  • improve stats / info

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b5, 2.0.0rc1 Jan 22, 2023
@ThomasWaldmann ThomasWaldmann removed their assignment Jun 8, 2023
@ThomasWaldmann ThomasWaldmann moved this to hashindex / cache in breaking Jul 14, 2024
@ThomasWaldmann
Copy link
Member Author

Closing this, see comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants