Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Closed
wants to merge 22 commits into from

Conversation

zhubonan
Copy link
Collaborator

@zhubonan zhubonan commented Jan 11, 2022

This PR addresses #124 and potentially also #123.

The PR is still a work in process. My current aim is to have both two issues addressed here.

Brief summary

When writing to pack files, a ZIP-style local header is written before each record. To seal a pack, its integrity is checked and the local headers are updated with checksums, and a central header directory is appended to the end. This operations does not block and any concurrent read access. A new table is introduced to storage the status of each pack file. Technically, writing to a sealed pack works just fine, and the pack can still be resealled, but this would waste some spaces. Hence the table is consulted and sealed packs will not be selected for writing.

After sealing, the pack file becomes a ZIP archive. Each record is a file named by the first 16 characters of its hash value (there can be duplicated names, but the chance is low due to the fine sizes of each sealed pack). The raw data may access by simply extracting the archive using any ZIP compatibe tools.

Todo

The full hashkey is 64 bytes long, that is a bit too much. The need
here is to ensure each file haev different names. If there are two
files with the same name the ZIP file is still valid, but requires
careful handling when extracting the file.

Optionally, we can verify filename uniqueness when sealing the pack.
The data descriptor is designed for non-seekable stream. Previously,
it was used to store the file/compressed lengths. However, since
the pack file is only made to be valid ZIP file at the time of sealing
we can skip the datadescriptor and instead update the CRC, file_size
and compressed_size at the time of sealling, togther with the filename.

This save the 20 bytes data descriptor for each record.
Note that the CRC still needs to be stored. I make it store in the
SQLite database for now. This means that old SQLite database may
need to be migrated. Alternatiely, I can have it in a different table.
It can be decided later.
The return values have been updated to contain the actual offset
of the data, not including the header.
The Pack table contains a list of sealed/archived packs.
Previous pack format should be validated differently compared with the
zip-compatible format, due to the existence of local headers.
The method to get the pack id to write to is also updated such
that it only return unsealed packs
Fix and supress various pre-commit messages.
@giovannipizzi
Copy link
Member

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

@zhubonan
Copy link
Collaborator Author

zhubonan commented Jan 11, 2022

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

It is just for spacing saving, since the file name has to be included in both the central directory and the local header. Including the full hash (taking 64 bytes as ASCII) is not really necessary since that can be computed from the content.
We can check for clash when sealing the data perhaps.

@giovannipizzi
Copy link
Member

For 10 million objects the cost would be ~600MB (times two = ~1.2GB) with the full hash. So with 16 bytes we spare ~ 900MB. I'm tempted to say it's OK not to spare this space but we have a simpler code? (also, 10 million objects would most probably fill many packs, so the 900MB will most probably be a small % of the total size in most cases).

@zhubonan
Copy link
Collaborator Author

Hmm I see, in that case I will just go with the full hash. The part that gets simplied is that we don't have to check for any duplications when assigning file names during the sealing process. When updating the filename length still has to be verified, otherwise there is a risk of overwriting the actual data.

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

@giovannipizzi
Copy link
Member

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

But in this case you need to store the hashkey sharded (e.g. 01/839ab302...), depending on the settings of the container. (The default is 1 level of length 2, but it's configurable).

I don't know if this in total might increase or decrease the size of the ZIP? Filenames are 2 bytes shorter (I don't know if only the actual filename or the full path should be put in front of every file?), but now we have also to store in which folder, so probably this makes things even more expensive?

@zhubonan
Copy link
Collaborator Author

Oh I see, my bad. I thought that the loose files are stored in a flat directory 😓
Whith sharded directory, one should in principle be able to write a simgple bash script to put the file in the correct folder with correct name. Maybe it is a bit stretched as intented usage though.

@giovannipizzi giovannipizzi changed the title allow sealling pack files which turn them into ZIP archives allow sealing pack files in a way that turns them into valid ZIP archives Jan 14, 2022
@giovannipizzi giovannipizzi changed the title allow sealing pack files in a way that turns them into valid ZIP archives Allow sealing pack files in a way that turns them into valid ZIP archives Jan 14, 2022
Now the file name will take 64 characters (bytes) long, including
the full hexdigest of the SHA256 hash.
@codecov
Copy link

codecov bot commented May 11, 2022

Codecov Report

Merging #133 (3ccd60a) into develop (16e6ff9) will decrease coverage by 3.13%.
The diff coverage is 77.20%.

@@             Coverage Diff             @@
##           develop     #133      +/-   ##
===========================================
- Coverage    99.52%   96.38%   -3.14%     
===========================================
  Files            8        9       +1     
  Lines         1676     1936     +260     
===========================================
+ Hits          1668     1866     +198     
- Misses           8       70      +62     
Impacted Files Coverage Δ
disk_objectstore/zipsupport.py 61.06% <61.06%> (ø)
disk_objectstore/container.py 97.66% <85.59%> (-1.74%) ⬇️
disk_objectstore/utils.py 99.43% <96.66%> (-0.17%) ⬇️
disk_objectstore/database.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16e6ff9...3ccd60a. Read the comment docs.

@zhubonan
Copy link
Collaborator Author

Just want to recap on what we have discussed previously. The sealing would be treated as a slow operation similar to repack. I wil try to see if we can use the zipfile stdlib to directly construct a zip file and have the offsets/size/length of each entry recorded in the sqlite. This way the code can be simplified a lot and easier to maintain.

I also did some investation on my repository. It is true that most of the objects are small files. The median size is just only 2kytes! But the small files also take relatively small portion of the total size, despite in large number. So a overhead of about 100-200 bytes per object is a lot for the small objects, but the overall impact on the total size is much smaller. My repo is 80G on disk, and the estimated overhead based on total number of object is about 100-200MB. So I think that having archived pack files would still have minimal overalls impact on the total size on disk.

@zhubonan
Copy link
Collaborator Author

close in favour of #138

@zhubonan zhubonan closed this May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants