-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data descriptor for no compressed file #17
Comments
Ah - it might indeed be a valid ZIP, but I think that this would make it impossible to read the ZIP file with stream-unzippers like https://github.com/uktrade/stream-unzip , because to stream unZIP a file, it either has to have the compressed size up-front, or use some sort of compression format that indicates its own end (like deflate). So the comment in the README is a bit misleading... it should probably say "in order to stream unZIP" or something like that. To check though, do you have a particular use case for stream zipping large files without compression? |
Thanks. Then I see the trouble here. Sadly ZIP isn't really designed for streaming. In my scenario I'm working on a file packaging feature for a web service. It fetches a group of images from cloud storage and makes a ZIP file on the fly. Since original images are already in PNG, it doesn't make much sense to compress them again. Some images can be up to 300MB so streaming should be an ideal way to avoid OOM. |
Yep!
Ah I see... yes a reasonable reason to want to not compress. One thing I notice, usually cloud storage does give the size of a file? (e.g. S3 gives size in the content-length header in GET responses)... So I think you have roughly 4 courses of action you can take
|
I'm in favor of this one but there's one problem: It's easy to know file size but having CRC32 in advance is non-trivial. To fix this we can keep the file descriptor. Hopefully the unzip implementation reads file sizes from header and CRC32 from data descriptor. This looks quite tricky though so I'm not sure about its compatibility. |
Oh you're right! I forgot that without a data descriptor, CRC32 is needed up front... yeah, a bit tricky...
Hmmm... my instinct is that this won't have good compatibility... now you've reminded me what's in the data descriptor, it feels odd for something to use the size from the local file header say, but ignore its CRC32 value. Although I have just thought of another choice. The deflate algorithm supports non compression mode (block type "0"). So it's still deflate, and it still indicates its end, and so works fine with data descriptors, but nothing would be compressed. There might be a few wasted extra bytes in the stream, but it would be fairly low. Off the top of my head, something like 3 bytes per 64k... something like that... ... if you wanted to use stream-zip for that, there would have to be a change, but it would just be passing in some options to zlib I think. Nothing at the "zip" level would be different. |
Have now made it possible to set the zlib options for the zip file, by adding a So to force no compression, and avoid buffering the file into memory, can use:
|
For the limitation:
To avoid buffering all contents, I'm wondering if we can add a data descriptor after uncompressed data as well so we can simply set 0 in the local file header, just like what we do for compressed files. From the original cpython zipfile implementation, it seems to only determine from whether output is seekable:
https://github.com/python/cpython/blob/96b344c2f15cb09251018f57f19643fe20637392/Lib/zipfile.py#L1610-L1611
So I guess this structure should also be valid?
The text was updated successfully, but these errors were encountered: