the new (c)size ticket #2357

ThomasWaldmann · 2017-03-30T13:45:48Z

scope of this ticket

let's concentrate here on the issue of csize (and also size) information in the items' chunks lists, in the chunks and files cache. no crypto or other discussion in here, let's stay focussed.

csize

the main issue is that csize is not a direct function of the data, it also depends on compression and encryption (and other overhead) that is applied to the data. as both might change (and thus csize might change) while the chunk still contains the same (plaintext) content and has the same id, it is an annoyance to have csize in the chunks lists of archived items.

size

we must have chunk size information in the chunks lists of archived items for the case we lose multiple chunks in the repo - so we can replace them with all-zero chunks of same length. size is a direct function of the data, so no problem here if we change compression/encryption/overhead.

timing of size / csize computation

size is computed early, after the chunker has cut the chunks: len(chunk)
csize is computed late, after compression, after encryption/authentication. note: this can lead to a race (wait) condition in multithreaded processing.

where is chunk size/csize (not) stored?

repo: the current PUT entry in the segment file contains csize in the length information. no size available here! also, neither size nor csize is in the repo index.
archive: item.chunks = [(id, size, csize), ...]
chunks cache: id -> (refcount, size, csize)
files cache: no size/csize here: path_hash -> (file_size, ino, mtime, chunks=[id, id, ...])

where is size/csize used?

size dsize csize dcsize placeholders
Statistics class + show_progress
chunk_incref (gets size/csize from chunks cache - important for archiving unchanged files)
csize: Archive.info -> limits -> max_archive_size and Archive.__str__
csize: Cache.__str__ .chunks_stored_size
size: do_diff sum_chunk_sizes (to show sum of lengths of added/removed chunks of a file)
size: borg check size consistency check item.size == sum(chunks size)
tests

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2017-03-30T15:19:38Z

Comment by @enkore, moved from #2313:

The main problem with csize is not so much compatibility problems or something like that, but this issue:

Create an archive with some new chunks, they will have csize set
Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks
Delete the first archive
Sync the cache
Then the cache cannot know the csize of these chunks
So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

But that might be ok, since it's a contrived example. A worst case in the very literal sense. Adding a repository API that allows to only get the length of a chunk (and falling back to len(GET()) in the client if that API is not available) would avoid network cost, and mostly leave the I/O.

Compatibility issues are minor, e.g. an older Borg will show too small compressed/dedup sizes in borg info.

ThomasWaldmann · 2017-03-30T16:13:20Z

solving the special problems for unchanged-skipped files

progress indication / correct size/csize in archived items:

we get their chunk IDs list from the files cache.

currently, we only have the chunk ids there, not their sizes/csizes.

we do not really have chunk content at hand here, so no information about
size or csize - we get them via chunk_incref from chunks index for progress
indication and also to generate the chunks list for the item (with correct
size and csize).

in the adhoc chunks cache (see borg create without a cache – Prototype, Do Not Merge #2350), we do not have correct size or csize information!
we could modify files cache to include size (csize), then progress indication
would work normally for unchanged files and also the archived items would
have correct size/csize infos in their chunks list.

memory usage: 32b str -> tuple(32b str, int, int)
it is msgpacked, so maybe not that bad.
this would need to update chunks cache entries with unknown (c)size to the
known (c)size from the files cache.

enkore · 2017-04-06T10:03:02Z

Proposal

Problem

(1) Requiring csize means that things like AdhocChunksCache don't work, but we want that.

(2) csize in all references to a chunk also means that changing the chunk's compression means that stats can be slightly off depending on which version of the reference is seen first during cache sync.

(3) csize creates a dependency on the chunk processing before a chunk reference can be stored

Solution

(1) Introduce csize=0 (compact encoding, one byte) for these.

After a cache sync, the client iterates over the chunks cache and will retrieve all objects with csize=0 from the repository and set the csize according to the length of the retrieved object. A dedicated API to avoid transferring the object data itself may be added for this. This fixes the scenario:

Create an archive with some new chunks, they will have csize set

Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks

Delete the first archive

Sync the cache

Then the cache cannot know the csize of these chunks

So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

(2) Disregard slightly off stats due to recompression, and incomplete stats (at archive creation time) with AdhocChunksCache.

(3) The dependency of chunk references on csize and therefore chunk processing for a new chunk is actually an advantage, since it ensures that the system is always in a forward-consistent state: With the dependency on csize, the cache can only emit references to chunks that are already stored. Therefore, receiving a chunk reference from the cache implies that it will be contained within a repository commit initiated after receiving the reference. This simplifies reasoning about the system considerably, especially in a concurrent setting.

ThomasWaldmann · 2017-04-06T12:34:43Z

@enkore If you want to be able to implement the AdhocChunkscache later, I guess your proposal needs to address size also, not just csize, see my comments above. Even with my idea, there is a problem if the files cache is lost.

enkore · 2017-04-06T12:43:05Z

A first solution could be to just not use a files cache in this case.

This is may be a bit annoying with 1.1, though a quick back of the envelope calculation suggests that in many cases it would still be worth it. E.g. I know that with --no-files-cache a system backup needs about 15 minutes, but a cache sync to one of my larger repos takes longer than that. So --no-cache-sync (implying --no-files-cache) would still be faster. And with 1.2 this becomes even less of an issue.

enkore · 2017-06-10T16:56:06Z

#2654 implements my proposal above to the word.

ThomasWaldmann · 2022-06-12T15:34:40Z

#6763 removes csize everywhere.

something related is still in the entry length of a segment file entry and might come to the repo index via #6705.

ThomasWaldmann · 2022-07-04T16:23:30Z

This is now in borg2 branch, so the csize related issues are solved by removing it everywhere.

Update: merged into master now.

ThomasWaldmann · 2022-09-09T19:18:35Z

Note: size, csize, ctype, clevel is now available as separate encrypted metadata via repo.get(id, read_data=False) and RepoObj.parse_meta(chunk).

ThomasWaldmann · 2023-01-22T13:30:15Z

IIRC / AFAIK, there is nothing left we need to do right now, but we could do some improvements later:

improve the adhoc chunks cache
improve stats / info

ThomasWaldmann · 2024-08-30T22:20:28Z

Closing this, see comment above.

This was referenced Mar 30, 2017

get rid of csize #776

Closed

borg create without syncing cache #2313

Closed

enkore self-assigned this Jun 10, 2017

ThomasWaldmann mentioned this issue Jun 10, 2017

create: --no-cache-sync #2654

Merged

enkore added the c: cache label Jul 23, 2017

enkore removed their assignment Oct 14, 2017

ThomasWaldmann modified the milestones: beryllium, lithium Mar 29, 2022

ThomasWaldmann mentioned this issue Apr 16, 2022

borg2: it's coming! #6602

Open

ThomasWaldmann mentioned this issue Jun 11, 2022

borg2: there is no csize #6763

Merged

ThomasWaldmann self-assigned this Jun 12, 2022

ThomasWaldmann added the breaking label Jun 12, 2022

ThomasWaldmann modified the milestones: lithium, 2.0.0a2 Jun 26, 2022

ThomasWaldmann modified the milestones: 2.0.0a2, 2.0.0b1 Jul 4, 2022

ThomasWaldmann modified the milestones: 2.0.0b1, 2.0.0b2 Aug 4, 2022

ThomasWaldmann modified the milestones: 2.0.0b2, 2.0.0b3 Sep 9, 2022

ThomasWaldmann modified the milestones: 2.0.0b3, 2.0.0b4 Sep 29, 2022

ThomasWaldmann modified the milestones: 2.0.0b4, 2.0.0b5 Nov 26, 2022

ThomasWaldmann modified the milestones: 2.0.0b5, 2.0.0rc1 Jan 22, 2023

ThomasWaldmann removed their assignment Jun 8, 2023

ThomasWaldmann added this to breaking Jul 14, 2024

ThomasWaldmann moved this to hashindex / cache in breaking Jul 14, 2024

ThomasWaldmann closed this as completed Aug 30, 2024

ThomasWaldmann removed this from breaking Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the new (c)size ticket #2357

the new (c)size ticket #2357

ThomasWaldmann commented Mar 30, 2017 •

edited

Loading

ThomasWaldmann commented Mar 30, 2017

ThomasWaldmann commented Mar 30, 2017

enkore commented Apr 6, 2017 •

edited

Loading

ThomasWaldmann commented Apr 6, 2017

enkore commented Apr 6, 2017 •

edited

Loading

enkore commented Jun 10, 2017

ThomasWaldmann commented Jun 12, 2022 •

edited

Loading

ThomasWaldmann commented Jul 4, 2022 •

edited

Loading

ThomasWaldmann commented Sep 9, 2022

ThomasWaldmann commented Jan 22, 2023

ThomasWaldmann commented Aug 30, 2024

the new (c)size ticket #2357

the new (c)size ticket #2357

Comments

ThomasWaldmann commented Mar 30, 2017 • edited Loading

scope of this ticket

csize

size

timing of size / csize computation

where is chunk size/csize (not) stored?

where is size/csize used?

ThomasWaldmann commented Mar 30, 2017

ThomasWaldmann commented Mar 30, 2017

solving the special problems for unchanged-skipped files

enkore commented Apr 6, 2017 • edited Loading

Proposal

ThomasWaldmann commented Apr 6, 2017

enkore commented Apr 6, 2017 • edited Loading

enkore commented Jun 10, 2017

ThomasWaldmann commented Jun 12, 2022 • edited Loading

ThomasWaldmann commented Jul 4, 2022 • edited Loading

ThomasWaldmann commented Sep 9, 2022

ThomasWaldmann commented Jan 22, 2023

ThomasWaldmann commented Aug 30, 2024

ThomasWaldmann commented Mar 30, 2017 •

edited

Loading

enkore commented Apr 6, 2017 •

edited

Loading

enkore commented Apr 6, 2017 •

edited

Loading

ThomasWaldmann commented Jun 12, 2022 •

edited

Loading

ThomasWaldmann commented Jul 4, 2022 •

edited

Loading