Allow sealing pack files in a way that turns them into valid ZIP archives #133

zhubonan · 2022-01-11T18:41:49Z

This PR addresses #124 and potentially also #123.

The PR is still a work in process. My current aim is to have both two issues addressed here.

Brief summary

When writing to pack files, a ZIP-style local header is written before each record. To seal a pack, its integrity is checked and the local headers are updated with checksums, and a central header directory is appended to the end. This operations does not block and any concurrent read access. A new table is introduced to storage the status of each pack file. Technically, writing to a sealed pack works just fine, and the pack can still be resealled, but this would waste some spaces. Hence the table is consulted and sealed packs will not be selected for writing.

After sealing, the pack file becomes a ZIP archive. Each record is a file named by the first 16 characters of its hash value (there can be duplicated names, but the chance is low due to the fine sizes of each sealed pack). The raw data may access by simply extracting the archive using any ZIP compatibe tools.

Todo

Address allowing multiple pack storage locations #123

The full hashkey is 64 bytes long, that is a bit too much. The need here is to ensure each file haev different names. If there are two files with the same name the ZIP file is still valid, but requires careful handling when extracting the file. Optionally, we can verify filename uniqueness when sealing the pack.

The data descriptor is designed for non-seekable stream. Previously, it was used to store the file/compressed lengths. However, since the pack file is only made to be valid ZIP file at the time of sealing we can skip the datadescriptor and instead update the CRC, file_size and compressed_size at the time of sealling, togther with the filename. This save the 20 bytes data descriptor for each record. Note that the CRC still needs to be stored. I make it store in the SQLite database for now. This means that old SQLite database may need to be migrated. Alternatiely, I can have it in a different table. It can be decided later.

The return values have been updated to contain the actual offset of the data, not including the header.

The Pack table contains a list of sealed/archived packs.

Previous pack format should be validated differently compared with the zip-compatible format, due to the existence of local headers.

The method to get the pack id to write to is also updated such that it only return unsealed packs

Fix and supress various pre-commit messages.

giovannipizzi · 2022-01-11T20:30:40Z

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

zhubonan · 2022-01-11T20:46:59Z

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

It is just for spacing saving, since the file name has to be included in both the central directory and the local header. Including the full hash (taking 64 bytes as ASCII) is not really necessary since that can be computed from the content.
We can check for clash when sealing the data perhaps.

giovannipizzi · 2022-01-11T21:07:51Z

For 10 million objects the cost would be ~600MB (times two = ~1.2GB) with the full hash. So with 16 bytes we spare ~ 900MB. I'm tempted to say it's OK not to spare this space but we have a simpler code? (also, 10 million objects would most probably fill many packs, so the 900MB will most probably be a small % of the total size in most cases).

zhubonan · 2022-01-12T12:28:07Z

Hmm I see, in that case I will just go with the full hash. The part that gets simplied is that we don't have to check for any duplications when assigning file names during the sealing process. When updating the filename length still has to be verified, otherwise there is a risk of overwriting the actual data.

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

giovannipizzi · 2022-01-12T18:08:31Z

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

But in this case you need to store the hashkey sharded (e.g. 01/839ab302...), depending on the settings of the container. (The default is 1 level of length 2, but it's configurable).

I don't know if this in total might increase or decrease the size of the ZIP? Filenames are 2 bytes shorter (I don't know if only the actual filename or the full path should be put in front of every file?), but now we have also to store in which folder, so probably this makes things even more expensive?

zhubonan · 2022-01-13T10:04:59Z

Oh I see, my bad. I thought that the loose files are stored in a flat directory 😓
Whith sharded directory, one should in principle be able to write a simgple bash script to put the file in the correct folder with correct name. Maybe it is a bit stretched as intented usage though.

Now the file name will take 64 characters (bytes) long, including the full hexdigest of the SHA256 hash.

… into zip-style-marker

for more information, see https://pre-commit.ci

codecov · 2022-05-11T13:43:27Z

Codecov Report

Merging #133 (3ccd60a) into develop (16e6ff9) will decrease coverage by 3.13%.
The diff coverage is 77.20%.

@@             Coverage Diff             @@
##           develop     #133      +/-   ##
===========================================
- Coverage    99.52%   96.38%   -3.14%     
===========================================
  Files            8        9       +1     
  Lines         1676     1936     +260     
===========================================
+ Hits          1668     1866     +198     
- Misses           8       70      +62

Impacted Files	Coverage Δ
disk_objectstore/zipsupport.py	`61.06% <61.06%> (ø)`
disk_objectstore/container.py	`97.66% <85.59%> (-1.74%)`	⬇️
disk_objectstore/utils.py	`99.43% <96.66%> (-0.17%)`	⬇️
disk_objectstore/database.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16e6ff9...3ccd60a. Read the comment docs.

zhubonan · 2022-05-27T14:15:56Z

Just want to recap on what we have discussed previously. The sealing would be treated as a slow operation similar to repack. I wil try to see if we can use the zipfile stdlib to directly construct a zip file and have the offsets/size/length of each entry recorded in the sqlite. This way the code can be simplified a lot and easier to maintain.

I also did some investation on my repository. It is true that most of the objects are small files. The median size is just only 2kytes! But the small files also take relatively small portion of the total size, despite in large number. So a overhead of about 100-200 bytes per object is a lot for the small objects, but the overall impact on the total size is much smaller. My repo is 80G on disk, and the estimated overhead based on total number of object is about 100-200MB. So I think that having archived pack files would still have minimal overalls impact on the total size on disk.

zhubonan · 2022-05-30T10:03:53Z

close in favour of #138

zhubonan added 19 commits December 9, 2021 22:02

Initial commit for Zip-compatibility support

0b2cb9f

Fix a bug and tests

4e200f5

Added test for sealing pack files

46a6a2a

Added is_zip function

08e59b8

Move writing header inside _write_data_to_packfile

eaf1e9b

The return values have been updated to contain the actual offset of the data, not including the header.

Remove on-the-fly ucomputation of CRC

75ef344

Added pack table to the database

b705297

The Pack table contains a list of sealed/archived packs.

Added check for zip compatibility

00af11b

Previous pack format should be validated differently compared with the zip-compatible format, due to the existence of local headers.

Refectoring

fb1e871

Update the tests

da2d705

Fix compatibility check

264f4d2

Added method to seal all eligible packs

269661d

The method to get the pack id to write to is also updated such that it only return unsealed packs

ensure sealed packs will not be opened for writing

4045442

Pre-commit fix

1e2c92b

Fix and supress various pre-commit messages.

Update the readme about ZIP compatible pack format

15a9de0

Fix mode keyword argument

c6fa56c

Fix test

47c34ff

zhubonan mentioned this pull request Jan 11, 2022

Proof-of-concept implemenation for supporting multiple pack repositories #126

Closed

7 tasks

zhubonan marked this pull request as draft January 11, 2022 18:43

giovannipizzi changed the title ~~allow sealling pack files which turn them into ZIP archives~~ allow sealing pack files in a way that turns them into valid ZIP archives Jan 14, 2022

giovannipizzi changed the title ~~allow sealing pack files in a way that turns them into valid ZIP archives~~ Allow sealing pack files in a way that turns them into valid ZIP archives Jan 14, 2022

Use the full hash for the filename

b99b84d

Now the file name will take 64 characters (bytes) long, including the full hexdigest of the SHA256 hash.

zhubonan and others added 2 commits May 11, 2022 14:37

Merge branch 'develop' of https://github.com/aiidateam/disk-objectstore…

7de2b7d

… into zip-style-marker

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ccd60a

for more information, see https://pre-commit.ci

zhubonan mentioned this pull request May 30, 2022

implementation for archiving a pack file #138

Open

zhubonan closed this May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Allow sealing pack files in a way that turns them into valid ZIP archives #133

zhubonan commented Jan 11, 2022 •

edited

Loading

giovannipizzi commented Jan 11, 2022

zhubonan commented Jan 11, 2022 •

edited

Loading

giovannipizzi commented Jan 11, 2022

zhubonan commented Jan 12, 2022

giovannipizzi commented Jan 12, 2022

zhubonan commented Jan 13, 2022

codecov bot commented May 11, 2022 •

edited

Loading

zhubonan commented May 27, 2022

zhubonan commented May 30, 2022

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Conversation

zhubonan commented Jan 11, 2022 • edited Loading

Brief summary

Todo

giovannipizzi commented Jan 11, 2022

zhubonan commented Jan 11, 2022 • edited Loading

giovannipizzi commented Jan 11, 2022

zhubonan commented Jan 12, 2022

giovannipizzi commented Jan 12, 2022

zhubonan commented Jan 13, 2022

codecov bot commented May 11, 2022 • edited Loading

Codecov Report

zhubonan commented May 27, 2022

zhubonan commented May 30, 2022

zhubonan commented Jan 11, 2022 •

edited

Loading

zhubonan commented Jan 11, 2022 •

edited

Loading

codecov bot commented May 11, 2022 •

edited

Loading