Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zstd as a ZIP compression method #25

Closed
ghost opened this issue May 30, 2020 · 14 comments
Closed

Add Zstd as a ZIP compression method #25

ghost opened this issue May 30, 2020 · 14 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ghost
Copy link

ghost commented May 30, 2020

First of all, thank you for your work on enabling p7zip users to use Zstd with the 7z archive file format.

I would like to use Zstd as a ZIP compression method because of the random access feature present in ZIP but not in 7z:

  • For ZIP files, you can extract a file without having to process and decompress the entire archive.
  • For 7z files, I believe random access is currently not implemented. However, it looks like the Apache Commons Compress library does have a "slow random access" implementation now. (More info: COMPRESS-342)

In addition, the ZIP File Format Specification by PKWARE Inc. has added the Zstandard compression method ID, and WinZip has added the Zstd method to the ZIP (ZIPX) format (please see this PDF and this webpage).

I found that the compression ratio of a ZIP file using Zstd Level 3 is similar to Deflate64 Level 5, while Zstd Level 3 performs (much) faster than Deflate64 Level 5.

I tried adding Zstd as a ZIP compression method (please see my code here) and it seems I can create, list and update a ZIP file using Zstd.

@ghost
Copy link
Author

ghost commented May 30, 2020

Update:
It seems that my code can also extract the ZIP/ZIPX using Zstd created by WinZip, but WinZip uses Method ID 93 instead of 20 - maybe WinZip does not follow the PKWARE specification?

@ghost
Copy link
Author

ghost commented May 30, 2020

I added both Method IDs in c66c832, which means the -mm=WzZstd switch should create Zstd ZIP files compatible with WinZip (to be tested).

The latest version of PKWARE's PKZIP for Windows does not offer Zstd compression in ZIP files, and I don't know what other ZIP programs (besides WinZip) offer Zstd support.
The -mm=Zstd is still available so that the created file follows PKWARE's specification and uses Method ID 20.

@ghost
Copy link
Author

ghost commented May 31, 2020

@ayende

I read your blog article "Random access compression and zstd" and learned that you would like to have both the space savings and the random access option.

Would the ZIP format with the Zstd compression help? Any feedback about this feature would be greatly appreciated.

@ayende
Copy link

ayende commented May 31, 2020

A key problem with zip files is that each file is compressed independently. That is crucial to allow random access, after all.
If I do tar.gz, on the other hand, I get better compression rate, but no random access.

With zstd, you can have dictionary that is trained over the files and get the best of both worlds.
I'm not sure how well that would fit into the ZIP specification, and it will likely not be portable for a while.

That said, we already implemented this in our software, which deals with compression of data inside a database. Having that in a zip file is nice, but not required.

@jinfeihan57
Copy link
Contributor

@ipaucek4680
Actually,7z archive can be random access. Using : ./7z e t.7z CPP/7zip/Guid.txt
just make sure t.7z have CPP/7zip/Guid.txt ,you can use : ./7z l t.7z .lists all files from archive t.7z

@jinfeihan57
Copy link
Contributor

@ayende @ipaucek4680
I read you blog about Random access compression and zstd.
7z archive can be random access. May not be as efficient as random access with zip.You can compare them.
At the end of your article,you said "It is also possible for the dictionary to make the compression rate worse, so that is fun." The dictionary is to compress many small files, and it is only efficient when there is a lot of duplicate information in these small files.
For example, you use a dictionary trained on character files to compress binary files. You will get worse results. Or there may be many small files but the file types are different. You cannot use a dictionary in these situations.
There is one more thing to pay attention to when using the dictionary. Once your dictionary is trained, you ca n’t change it casually, because it is needed for both compression and decompression. In some cases, the data changes over time, and the previous dictionary may not be suitable for the current data.

@ghost
Copy link
Author

ghost commented Jun 1, 2020

@jinfeihan57
Suppose that:

  1. We have abigfile.7z which is 5 GB in size (and uses a slow compression algorithm such as LZMA/LZMA2), which contains p7zip source code, as well as lots of miscellaneous files (videos, music, photos, etc.).
  2. We want to extract CPP/7zip/Guid.txt, but unfortunately it is in the middle of the archive:
|---------| Guid.txt |------|
|< 3 GB  >|    ^     |<2 GB>|
  1. We use ./7z e abigfile.7z CPP/7zip/Guid.txt to extract the file we need.

In this case, is 7-zip (or p7zip) able to skip reading and decompressing the first 3 GB?

For ZIP, it can access Guid.txt without reading and decompressing the first 3 GB, which will be much faster.

@ayende
Copy link

ayende commented Jun 1, 2020

@jinfeihan57 I actually continuously test the compression ratio, and it if drops below a certain value, we'll generate a new dictionary based on recent information. So it is self adjusting.

@jinfeihan57
Copy link
Contributor

@ipaucek4680
Like I said "May not be as efficient as random access with zip."
In order to obtain a higher compression ratio, 7z will classify the files and then compress the files of the same type together. So 7z cannot be random access as efficiently as zip. But when there are few files, this has little effect on efficiency. Even if 7z is not real random access , the efficiency is okay (I suggest you compare them). In order to obtain the compression ratio, the necessary sacrifice.
I read this Unsupported Zip Format.Regarding adding zstd algorithm to zip, this suggestion is very good. But currently I will focus on updating p7zip, and of course I will target your proposal (I will add it to the wiki). The zip in 7z supports the lzma compression algorithm. If you want to get a higher compression ratio, you can use the lzma algorithm first. If you want to get a higher compression ratio and faster decompression speed, you need zstd.

@jinfeihan57
Copy link
Contributor

@ayende
So you store a lot of dictionary files, each dictionary corresponds to compressed or decompressed files of a certain period of time . Am i getting it right?

@ayende
Copy link

ayende commented Jun 1, 2020

I may end up with a lot of dictionaries, yes.
Note that I tested this on multi GB data sets (with millions of files).
I end us with less than 150 dictionaries of up to 8KB each

@jinfeihan57
Copy link
Contributor

@ipaucek4680
./7z a t.7z ./dirname -ms=off
using -ms=off to set solid mode off. that can make files compressed independently.
Maybe it can improve the efficiency of random access.

@jinfeihan57 jinfeihan57 added the help wanted Extra attention is needed label Jul 10, 2020
@tansy
Copy link
Contributor

tansy commented Aug 4, 2020

There is nothing easier than creating non-solid archive.

Every file separately:

$ 7z a -ms=off arc-ms-off.7z *

In N-files blocks ( N is number of files in block, here: 100):

$ 7z a -ms=100f arc-ms-100f.7z *

In M-bytes blocks ( M - size of block, here: 10MB)

$ 7z a -ms=10m arc-ms-10m.7z *

All files in one solid block:

$ 7z a -ms=on arc-ms-on.7z *

As @jinfeihan57 said it will allow you random access but will decrease compression. You may want to test block solid options that will allow you faster access and better than non-solid but worse than (full) solid compression. You can test how exactly it works with $ 7z l archive.7z

More about methods.

@cielavenir cielavenir mentioned this issue Apr 4, 2022
@tsjnachos117
Copy link

Regarding which method id to use, zip format version 6.3.7 uses id 20, but 6.3.8 moved it to 93 for some reason. (Source: wikipedia)

So, which method we want to use depends on which version of the zip standard we want to follow. I'm guessing that's why WinZip uses id 93, as that pertains to a newer version of zip. Maybe we should just do the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants