Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data descriptor for no compressed file #17

Closed
ArchangelSDY opened this issue Feb 6, 2022 · 6 comments
Closed

Data descriptor for no compressed file #17

ArchangelSDY opened this issue Feb 6, 2022 · 6 comments

Comments

@ArchangelSDY
Copy link

For the limitation:

No compression is supported via the NO_COMPRESSION_* constants as in the above examples. However in these cases the entire contents of each uncompressed file is buffered in memory, and so should not be used for large files. This is because for uncompressed data, its size and CRC32 must be before it in the ZIP file.

To avoid buffering all contents, I'm wondering if we can add a data descriptor after uncompressed data as well so we can simply set 0 in the local file header, just like what we do for compressed files. From the original cpython zipfile implementation, it seems to only determine from whether output is seekable:

https://github.com/python/cpython/blob/96b344c2f15cb09251018f57f19643fe20637392/Lib/zipfile.py#L1610-L1611

So I guess this structure should also be valid?

@ArchangelSDY ArchangelSDY changed the title Data descriptor for no compression Data descriptor for no compressed file Feb 6, 2022
@michalc
Copy link
Member

michalc commented Feb 6, 2022

No compression is supported via the NO_COMPRESSION_* constants as in the above examples. However in these cases the entire contents of each uncompressed file is buffered in memory, and so should not be used for large files. This is because for uncompressed data, its size and CRC32 must be before it in the ZIP file.

To avoid buffering all contents, I'm wondering if we can add a data descriptor after uncompressed data as well so we can simply set 0 in the local file header [...] So I guess this structure should also be valid?

Ah - it might indeed be a valid ZIP, but I think that this would make it impossible to read the ZIP file with stream-unzippers like https://github.com/uktrade/stream-unzip , because to stream unZIP a file, it either has to have the compressed size up-front, or use some sort of compression format that indicates its own end (like deflate).

So the comment in the README is a bit misleading... it should probably say "in order to stream unZIP" or something like that.

To check though, do you have a particular use case for stream zipping large files without compression?

@ArchangelSDY
Copy link
Author

Thanks. Then I see the trouble here. Sadly ZIP isn't really designed for streaming.

In my scenario I'm working on a file packaging feature for a web service. It fetches a group of images from cloud storage and makes a ZIP file on the fly. Since original images are already in PNG, it doesn't make much sense to compress them again. Some images can be up to 300MB so streaming should be an ideal way to avoid OOM.

@michalc
Copy link
Member

michalc commented Feb 6, 2022

Sadly ZIP isn't really designed for streaming.

Yep!

In my scenario I'm working on a file packaging feature for a web service. It fetches a group of images from cloud storage and makes a ZIP file on the fly. Since original images are already in PNG, it doesn't make much sense to compress them again. Some images can be up to 300MB so streaming should be an ideal way to avoid OOM.

Ah I see... yes a reasonable reason to want to not compress. One thing I notice, usually cloud storage does give the size of a file? (e.g. S3 gives size in the content-length header in GET responses)...

So I think you have roughly 4 courses of action you can take

  1. The easy one - compress them anyway using stream-zip... while unnecessary, it is likely fine?

  2. Fork this repo to allow non-compressed files without a data descriptor, and changing its API to use the size of the file from cloud storage. Potentially we could merge changes back in, but I can't guarantee that.

    (or... wait for us/me to do this, but can't give an ETA on this)

  3. Fork this repo to allow non-compressed files with a data descriptor... this would mean the file won't be stream-unzip friendly (I'm not a fan of this one)

  4. Use a different Python stream zip library? There are others, and I'm not sure on exact feature set of all of them

@ArchangelSDY
Copy link
Author

Fork this repo to allow non-compressed files without a data descriptor, and changing its API to use the size of the file from cloud storage.

I'm in favor of this one but there's one problem: It's easy to know file size but having CRC32 in advance is non-trivial. To fix this we can keep the file descriptor. Hopefully the unzip implementation reads file sizes from header and CRC32 from data descriptor. This looks quite tricky though so I'm not sure about its compatibility.

@michalc
Copy link
Member

michalc commented Feb 6, 2022

It's easy to know file size but having CRC32 in advance is non-trivial.

Oh you're right! I forgot that without a data descriptor, CRC32 is needed up front... yeah, a bit tricky...

Hopefully the unzip implementation reads file sizes from header and CRC32 from data descriptor. This looks quite tricky though so I'm not sure about its compatibility.

Hmmm... my instinct is that this won't have good compatibility... now you've reminded me what's in the data descriptor, it feels odd for something to use the size from the local file header say, but ignore its CRC32 value.

Although I have just thought of another choice. The deflate algorithm supports non compression mode (block type "0"). So it's still deflate, and it still indicates its end, and so works fine with data descriptors, but nothing would be compressed. There might be a few wasted extra bytes in the stream, but it would be fairly low. Off the top of my head, something like 3 bytes per 64k... something like that...

... if you wanted to use stream-zip for that, there would have to be a change, but it would just be passing in some options to zlib I think. Nothing at the "zip" level would be different.

@michalc
Copy link
Member

michalc commented Aug 21, 2022

Although I have just thought of another choice. The deflate algorithm supports non compression mode (block type "0"). So it's still deflate, and it still indicates its end, and so works fine with data descriptors, but nothing would be compressed. There might be a few wasted extra bytes in the stream, but it would be fairly low. Off the top of my head, something like 3 bytes per 64k... something like that...

... if you wanted to use stream-zip for that, there would have to be a change, but it would just be passing in some options to zlib I think. Nothing at the "zip" level would be different.

Have now made it possible to set the zlib options for the zip file, by adding a get_compressobj function to the stream_zip function. Released in v0.0.48.

So to force no compression, and avoid buffering the file into memory, can use:

get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants