Add Zstd as a ZIP compression method #25

ghost · 2020-05-30T06:48:19Z

First of all, thank you for your work on enabling p7zip users to use Zstd with the 7z archive file format.

I would like to use Zstd as a ZIP compression method because of the random access feature present in ZIP but not in 7z:

For ZIP files, you can extract a file without having to process and decompress the entire archive.
For 7z files, I believe random access is currently not implemented. However, it looks like the Apache Commons Compress library does have a "slow random access" implementation now. (More info: COMPRESS-342)

In addition, the ZIP File Format Specification by PKWARE Inc. has added the Zstandard compression method ID, and WinZip has added the Zstd method to the ZIP (ZIPX) format (please see this PDF and this webpage).

I found that the compression ratio of a ZIP file using Zstd Level 3 is similar to Deflate64 Level 5, while Zstd Level 3 performs (much) faster than Deflate64 Level 5.

I tried adding Zstd as a ZIP compression method (please see my code here) and it seems I can create, list and update a ZIP file using Zstd.

ghost · 2020-05-30T08:05:05Z

Update:
It seems that my code can also extract the ZIP/ZIPX using Zstd created by WinZip, but WinZip uses Method ID 93 instead of 20 - maybe WinZip does not follow the PKWARE specification?

ghost · 2020-05-30T10:43:49Z

I added both Method IDs in c66c832, which means the -mm=WzZstd switch should create Zstd ZIP files compatible with WinZip (to be tested).

The latest version of PKWARE's PKZIP for Windows does not offer Zstd compression in ZIP files, and I don't know what other ZIP programs (besides WinZip) offer Zstd support.
The -mm=Zstd is still available so that the created file follows PKWARE's specification and uses Method ID 20.

ghost · 2020-05-31T08:09:36Z

@ayende

I read your blog article "Random access compression and zstd" and learned that you would like to have both the space savings and the random access option.

Would the ZIP format with the Zstd compression help? Any feedback about this feature would be greatly appreciated.

ayende · 2020-05-31T09:55:24Z

A key problem with zip files is that each file is compressed independently. That is crucial to allow random access, after all.
If I do tar.gz, on the other hand, I get better compression rate, but no random access.

With zstd, you can have dictionary that is trained over the files and get the best of both worlds.
I'm not sure how well that would fit into the ZIP specification, and it will likely not be portable for a while.

That said, we already implemented this in our software, which deals with compression of data inside a database. Having that in a zip file is nice, but not required.

jinfeihan57 · 2020-06-01T02:28:55Z

@ipaucek4680
Actually，7z archive can be random access. Using : ./7z e t.7z CPP/7zip/Guid.txt
just make sure t.7z have CPP/7zip/Guid.txt ,you can use : ./7z l t.7z .lists all files from archive t.7z

jinfeihan57 · 2020-06-01T03:06:26Z

@ayende @ipaucek4680
I read you blog about Random access compression and zstd.
7z archive can be random access. May not be as efficient as random access with zip.You can compare them.
At the end of your article，you said "It is also possible for the dictionary to make the compression rate worse, so that is fun." The dictionary is to compress many small files, and it is only efficient when there is a lot of duplicate information in these small files.
For example, you use a dictionary trained on character files to compress binary files. You will get worse results. Or there may be many small files but the file types are different. You cannot use a dictionary in these situations.
There is one more thing to pay attention to when using the dictionary. Once your dictionary is trained, you ca n’t change it casually, because it is needed for both compression and decompression. In some cases, the data changes over time, and the previous dictionary may not be suitable for the current data.

ghost · 2020-06-01T03:38:54Z

@jinfeihan57
Suppose that:

We have abigfile.7z which is 5 GB in size (and uses a slow compression algorithm such as LZMA/LZMA2), which contains p7zip source code, as well as lots of miscellaneous files (videos, music, photos, etc.).
We want to extract CPP/7zip/Guid.txt, but unfortunately it is in the middle of the archive:

|---------| Guid.txt |------|
|< 3 GB  >|    ^     |<2 GB>|

We use ./7z e abigfile.7z CPP/7zip/Guid.txt to extract the file we need.

In this case, is 7-zip (or p7zip) able to skip reading and decompressing the first 3 GB?

For ZIP, it can access Guid.txt without reading and decompressing the first 3 GB, which will be much faster.

ayende · 2020-06-01T07:03:15Z

@jinfeihan57 I actually continuously test the compression ratio, and it if drops below a certain value, we'll generate a new dictionary based on recent information. So it is self adjusting.

jinfeihan57 · 2020-06-01T08:56:38Z

@ipaucek4680
Like I said "May not be as efficient as random access with zip."
In order to obtain a higher compression ratio, 7z will classify the files and then compress the files of the same type together. So 7z cannot be random access as efficiently as zip. But when there are few files, this has little effect on efficiency. Even if 7z is not real random access , the efficiency is okay (I suggest you compare them). In order to obtain the compression ratio, the necessary sacrifice.
I read this Unsupported Zip Format.Regarding adding zstd algorithm to zip, this suggestion is very good. But currently I will focus on updating p7zip, and of course I will target your proposal (I will add it to the wiki). The zip in 7z supports the lzma compression algorithm. If you want to get a higher compression ratio, you can use the lzma algorithm first. If you want to get a higher compression ratio and faster decompression speed, you need zstd.

jinfeihan57 · 2020-06-01T11:11:41Z

@ayende
So you store a lot of dictionary files, each dictionary corresponds to compressed or decompressed files of a certain period of time . Am i getting it right？

ayende · 2020-06-01T11:49:56Z

I may end up with a lot of dictionaries, yes.
Note that I tested this on multi GB data sets (with millions of files).
I end us with less than 150 dictionaries of up to 8KB each

jinfeihan57 · 2020-06-15T07:54:09Z

@ipaucek4680
./7z a t.7z ./dirname -ms=off
using -ms=off to set solid mode off. that can make files compressed independently.
Maybe it can improve the efficiency of random access.

tansy · 2020-08-04T19:44:51Z

There is nothing easier than creating non-solid archive.

Every file separately:

$ 7z a -ms=off arc-ms-off.7z *

In N-files blocks ( N is number of files in block, here: 100):

$ 7z a -ms=100f arc-ms-100f.7z *

In M-bytes blocks ( M - size of block, here: 10MB)

$ 7z a -ms=10m arc-ms-10m.7z *

All files in one solid block:

$ 7z a -ms=on arc-ms-on.7z *

As @jinfeihan57 said it will allow you random access but will decrease compression. You may want to test block solid options that will allow you faster access and better than non-solid but worse than (full) solid compression. You can test how exactly it works with $ 7z l archive.7z

More about methods.

tsjnachos117 · 2022-04-30T21:11:48Z

Regarding which method id to use, zip format version 6.3.7 uses id 20, but 6.3.8 moved it to 93 for some reason. (Source: wikipedia)

So, which method we want to use depends on which version of the zip standard we want to follow. I'm guessing that's why WinZip uses id 93, as that pertains to a newer version of zip. Maybe we should just do the same?

ghost mentioned this issue May 30, 2020

Add Zstd as a ZIP compression method mcmilk/7-Zip-zstd#132

Closed

jinfeihan57 added the enhancement New feature or request label Jun 3, 2020

This was referenced Jun 14, 2020

Support decompression compression algorithm using zstd/lzma/bzip2 zip file. mholt/archiver#223

Merged

Add zstd algorithm support in zip. zlib-ng/minizip-ng#498

Closed

jinfeihan57 added the help wanted Extra attention is needed label Jul 10, 2020

cielavenir mentioned this issue Apr 4, 2022

Codecs dir #172

Closed

antermin mentioned this issue Jul 2, 2022

Zstandard is also part of zip format since 2020 #183

Closed

jinfeihan57 closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zstd as a ZIP compression method #25

Add Zstd as a ZIP compression method #25

ghost commented May 30, 2020

ghost commented May 30, 2020

ghost commented May 30, 2020

ghost commented May 31, 2020

ayende commented May 31, 2020

jinfeihan57 commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

ghost commented Jun 1, 2020

ayende commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

ayende commented Jun 1, 2020

jinfeihan57 commented Jun 15, 2020

tansy commented Aug 4, 2020 •

edited

Loading

tsjnachos117 commented Apr 30, 2022

Add Zstd as a ZIP compression method #25

Add Zstd as a ZIP compression method #25

Comments

ghost commented May 30, 2020

ghost commented May 30, 2020

ghost commented May 30, 2020

ghost commented May 31, 2020

ayende commented May 31, 2020

jinfeihan57 commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

ghost commented Jun 1, 2020

ayende commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

jinfeihan57 commented Jun 1, 2020

ayende commented Jun 1, 2020

jinfeihan57 commented Jun 15, 2020

tansy commented Aug 4, 2020 • edited Loading

tsjnachos117 commented Apr 30, 2022

tansy commented Aug 4, 2020 •

edited

Loading