Data descriptor for no compressed file #17

ArchangelSDY · 2022-02-06T06:51:21Z

For the limitation:

No compression is supported via the NO_COMPRESSION_* constants as in the above examples. However in these cases the entire contents of each uncompressed file is buffered in memory, and so should not be used for large files. This is because for uncompressed data, its size and CRC32 must be before it in the ZIP file.

To avoid buffering all contents, I'm wondering if we can add a data descriptor after uncompressed data as well so we can simply set 0 in the local file header, just like what we do for compressed files. From the original cpython zipfile implementation, it seems to only determine from whether output is seekable:

https://github.com/python/cpython/blob/96b344c2f15cb09251018f57f19643fe20637392/Lib/zipfile.py#L1610-L1611

So I guess this structure should also be valid?

michalc · 2022-02-06T07:30:52Z

No compression is supported via the NO_COMPRESSION_* constants as in the above examples. However in these cases the entire contents of each uncompressed file is buffered in memory, and so should not be used for large files. This is because for uncompressed data, its size and CRC32 must be before it in the ZIP file.

To avoid buffering all contents, I'm wondering if we can add a data descriptor after uncompressed data as well so we can simply set 0 in the local file header [...] So I guess this structure should also be valid?

Ah - it might indeed be a valid ZIP, but I think that this would make it impossible to read the ZIP file with stream-unzippers like https://github.com/uktrade/stream-unzip , because to stream unZIP a file, it either has to have the compressed size up-front, or use some sort of compression format that indicates its own end (like deflate).

So the comment in the README is a bit misleading... it should probably say "in order to stream unZIP" or something like that.

To check though, do you have a particular use case for stream zipping large files without compression?

ArchangelSDY · 2022-02-06T09:44:51Z

Thanks. Then I see the trouble here. Sadly ZIP isn't really designed for streaming.

In my scenario I'm working on a file packaging feature for a web service. It fetches a group of images from cloud storage and makes a ZIP file on the fly. Since original images are already in PNG, it doesn't make much sense to compress them again. Some images can be up to 300MB so streaming should be an ideal way to avoid OOM.

michalc · 2022-02-06T11:06:04Z

Sadly ZIP isn't really designed for streaming.

Yep!

In my scenario I'm working on a file packaging feature for a web service. It fetches a group of images from cloud storage and makes a ZIP file on the fly. Since original images are already in PNG, it doesn't make much sense to compress them again. Some images can be up to 300MB so streaming should be an ideal way to avoid OOM.

Ah I see... yes a reasonable reason to want to not compress. One thing I notice, usually cloud storage does give the size of a file? (e.g. S3 gives size in the content-length header in GET responses)...

So I think you have roughly 4 courses of action you can take

The easy one - compress them anyway using stream-zip... while unnecessary, it is likely fine?
Fork this repo to allow non-compressed files without a data descriptor, and changing its API to use the size of the file from cloud storage. Potentially we could merge changes back in, but I can't guarantee that.

(or... wait for us/me to do this, but can't give an ETA on this)
Fork this repo to allow non-compressed files with a data descriptor... this would mean the file won't be stream-unzip friendly (I'm not a fan of this one)
Use a different Python stream zip library? There are others, and I'm not sure on exact feature set of all of them

ArchangelSDY · 2022-02-06T14:47:37Z

Fork this repo to allow non-compressed files without a data descriptor, and changing its API to use the size of the file from cloud storage.

I'm in favor of this one but there's one problem: It's easy to know file size but having CRC32 in advance is non-trivial. To fix this we can keep the file descriptor. Hopefully the unzip implementation reads file sizes from header and CRC32 from data descriptor. This looks quite tricky though so I'm not sure about its compatibility.

michalc · 2022-02-06T17:54:00Z

It's easy to know file size but having CRC32 in advance is non-trivial.

Oh you're right! I forgot that without a data descriptor, CRC32 is needed up front... yeah, a bit tricky...

Hopefully the unzip implementation reads file sizes from header and CRC32 from data descriptor. This looks quite tricky though so I'm not sure about its compatibility.

Hmmm... my instinct is that this won't have good compatibility... now you've reminded me what's in the data descriptor, it feels odd for something to use the size from the local file header say, but ignore its CRC32 value.

Although I have just thought of another choice. The deflate algorithm supports non compression mode (block type "0"). So it's still deflate, and it still indicates its end, and so works fine with data descriptors, but nothing would be compressed. There might be a few wasted extra bytes in the stream, but it would be fairly low. Off the top of my head, something like 3 bytes per 64k... something like that...

... if you wanted to use stream-zip for that, there would have to be a change, but it would just be passing in some options to zlib I think. Nothing at the "zip" level would be different.

michalc · 2022-08-21T05:28:12Z

Although I have just thought of another choice. The deflate algorithm supports non compression mode (block type "0"). So it's still deflate, and it still indicates its end, and so works fine with data descriptors, but nothing would be compressed. There might be a few wasted extra bytes in the stream, but it would be fairly low. Off the top of my head, something like 3 bytes per 64k... something like that...

... if you wanted to use stream-zip for that, there would have to be a change, but it would just be passing in some options to zlib I think. Nothing at the "zip" level would be different.

Have now made it possible to set the zlib options for the zip file, by adding a get_compressobj function to the stream_zip function. Released in v0.0.48.

So to force no compression, and avoid buffering the file into memory, can use:

get_compressobj=lambda: zlib.compressobj(wbits=-zlib.MAX_WBITS, level=0)

ArchangelSDY changed the title ~~Data descriptor for no compression~~ Data descriptor for no compressed file Feb 6, 2022

michalc mentioned this issue Aug 21, 2022

feat: allow specification of zlib compressobj #18

Merged

michalc closed this as completed Aug 21, 2022

Arcitec mentioned this issue Jun 22, 2023

Killed because Out of memory #24

Closed

ikreymer mentioned this issue Jul 25, 2024

support optional crc32 for uncompressed streaming zip32 and zip64 #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data descriptor for no compressed file #17

Data descriptor for no compressed file #17

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022 •

edited

Loading

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022

michalc commented Aug 21, 2022

Data descriptor for no compressed file #17

Data descriptor for no compressed file #17

Comments

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022 • edited Loading

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022

ArchangelSDY commented Feb 6, 2022

michalc commented Feb 6, 2022

michalc commented Aug 21, 2022

michalc commented Feb 6, 2022 •

edited

Loading