-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sealing pack files in a way that turns them into valid ZIP archives #133
Conversation
The full hashkey is 64 bytes long, that is a bit too much. The need here is to ensure each file haev different names. If there are two files with the same name the ZIP file is still valid, but requires careful handling when extracting the file. Optionally, we can verify filename uniqueness when sealing the pack.
The data descriptor is designed for non-seekable stream. Previously, it was used to store the file/compressed lengths. However, since the pack file is only made to be valid ZIP file at the time of sealing we can skip the datadescriptor and instead update the CRC, file_size and compressed_size at the time of sealling, togther with the filename. This save the 20 bytes data descriptor for each record. Note that the CRC still needs to be stored. I make it store in the SQLite database for now. This means that old SQLite database may need to be migrated. Alternatiely, I can have it in a different table. It can be decided later.
The return values have been updated to contain the actual offset of the data, not including the header.
The Pack table contains a list of sealed/archived packs.
Previous pack format should be validated differently compared with the zip-compatible format, due to the existence of local headers.
The method to get the pack id to write to is also updated such that it only return unsealed packs
Fix and supress various pre-commit messages.
Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only? |
It is just for spacing saving, since the file name has to be included in both the central directory and the local header. Including the full hash (taking 64 bytes as ASCII) is not really necessary since that can be computed from the content. |
For 10 million objects the cost would be ~600MB (times two = ~1.2GB) with the full hash. So with 16 bytes we spare ~ 900MB. I'm tempted to say it's OK not to spare this space but we have a simpler code? (also, 10 million objects would most probably fill many packs, so the 900MB will most probably be a small % of the total size in most cases). |
Hmm I see, in that case I will just go with the full hash. The part that gets simplied is that we don't have to check for any duplications when assigning file names during the sealing process. When updating the filename length still has to be verified, otherwise there is a risk of overwriting the actual data. Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another. |
But in this case you need to store the hashkey sharded (e.g. 01/839ab302...), depending on the settings of the container. (The default is 1 level of length 2, but it's configurable). I don't know if this in total might increase or decrease the size of the ZIP? Filenames are 2 bytes shorter (I don't know if only the actual filename or the full path should be put in front of every file?), but now we have also to store in which folder, so probably this makes things even more expensive? |
Oh I see, my bad. I thought that the loose files are stored in a flat directory 😓 |
Now the file name will take 64 characters (bytes) long, including the full hexdigest of the SHA256 hash.
… into zip-style-marker
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## develop #133 +/- ##
===========================================
- Coverage 99.52% 96.38% -3.14%
===========================================
Files 8 9 +1
Lines 1676 1936 +260
===========================================
+ Hits 1668 1866 +198
- Misses 8 70 +62
Continue to review full report at Codecov.
|
Just want to recap on what we have discussed previously. The sealing would be treated as a slow operation similar to repack. I wil try to see if we can use the zipfile stdlib to directly construct a zip file and have the offsets/size/length of each entry recorded in the sqlite. This way the code can be simplified a lot and easier to maintain. I also did some investation on my repository. It is true that most of the objects are small files. The median size is just only 2kytes! But the small files also take relatively small portion of the total size, despite in large number. So a overhead of about 100-200 bytes per object is a lot for the small objects, but the overall impact on the total size is much smaller. My repo is 80G on disk, and the estimated overhead based on total number of object is about 100-200MB. So I think that having archived pack files would still have minimal overalls impact on the total size on disk. |
close in favour of #138 |
This PR addresses #124 and potentially also #123.
The PR is still a work in process. My current aim is to have both two issues addressed here.
Brief summary
When writing to pack files, a ZIP-style local header is written before each record. To seal a pack, its integrity is checked and the local headers are updated with checksums, and a central header directory is appended to the end. This operations does not block and any concurrent read access. A new table is introduced to storage the status of each pack file. Technically, writing to a sealed pack works just fine, and the pack can still be resealled, but this would waste some spaces. Hence the table is consulted and sealed packs will not be selected for writing.
After sealing, the pack file becomes a ZIP archive. Each record is a file named by the first 16 characters of its hash value (there can be duplicated names, but the chance is low due to the fine sizes of each sealed pack). The raw data may access by simply extracting the archive using any ZIP compatibe tools.
Todo