Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding LZMA1 (LZMA Alone, Method ID 14) in libzip (Follow-up of #195) #198

Closed
ghost opened this issue Jul 24, 2020 · 25 comments
Closed
Labels
bug libzip doesn't behave as expected.

Comments

@ghost
Copy link

ghost commented Jul 24, 2020

To use the LZMA library (XZ Utils) for Method ID 14, I guess we need to refer to libarchive's implementation: archive_read_support_format_zip.c

So, in order to use the "lzma alone" decoder for the zipx lzma
stream, we simply need to shuffle around some fields, prepare a new
lzma alone header, feed it into lzma alone decoder so it will
initialize itself properly, and then we can start feeding normal
zipx lzma stream into the decoder.

That means the following code from zip_algorithm_xz.c is probably not enough (incomplete) and extra work/magic is needed to deal with LZMA1 ZIP files.

        if (ctx->method == ZIP_CM_LZMA)
            ret = lzma_alone_encoder(&ctx->zstr, filters[0].options);
        if (ctx->method == ZIP_CM_LZMA)
            ret = lzma_alone_decoder(&ctx->zstr, UINT64_MAX);

CC: @jinfeihan57


EDIT: See also libarchive/libarchive#1417
EDIT 2: See also https://sourceforge.net/p/lzmautils/discussion/708858/thread/663db6473d/#2919

@ghost ghost added the bug libzip doesn't behave as expected. label Jul 24, 2020
@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 24, 2020

Thanks for the libarchive link, that is helpful.
Can you please provide a zip archive with the same contents as regress/testfile-stored-dos.zip compressed with lzma (id 14)? I don't even know a tool that supports this.

@ghost
Copy link
Author

ghost commented Jul 24, 2020

Test file, using LZMA, created by WinZip 24:
lzma-winzip-24.zip

Test file, using LZMA, created by vanilla p7zip 16.02:
lzma-vanilla-p7zip-1602.zip

@ghost
Copy link
Author

ghost commented Jul 24, 2020

CC: @mholt
(mholt/archiver#226)

@jinfeihan57
Copy link

jinfeihan57 commented Jul 25, 2020

@jloqfjgk
I furged it out yesterday. LOL,about lzma header. it different between lzma format and ZIP lzma format.
in lzma1 format it has 13-byte header. but in ZIP lzma format,It's only have 9 byte header.and
the preceding 4 bytes are irrelevant to the algorithm. it's like that:
截图
puls 1 byte Properties, plus 4-byte Dictionary Size. Total 9 bytes .then comes the LZMA Compressed Data.
NO Uncompressed Size(8 bytes).

so we have to rebuild the lzma header in ZIP.

But it not a easy job

  1. the lzma_code() is a stream compress API. It only finished when lzma_action action = LZMA_FINISH.
    before is finished we can't change the head.
  2. I try to move the head_rebuild_code to function compress_read(***).but It's abstraction layer . it should not contain a specific algorithm.
    Now I try to Separating the LZMA compressed header and compressed data .(Working on)

Anyone have any suggestion?Please help.

One more thing :
I just finished update p7zip17.02, Please try it. Thanks! : )

@jinfeihan57
Copy link

@jloqfjgk This(archive_read_support_format_zip.c) is helped. I will check it out.
@0-wiz-0 Are you working on it? if not I want to help.

0-wiz-0 added a commit that referenced this issue Jul 25, 2020
@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 25, 2020

Please try 750a78e!

@ghost
Copy link
Author

ghost commented Jul 25, 2020

@0-wiz-0

If the compressed LZMA stream contains the optional end-of-stream (EOS) marker (APPNOTE Section 5.8.9 and 4.4.4), will libzip handle it properly?

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 25, 2020

Right now, libzip does not have any particular handling for an lzma EOS marker but just passes the compressed stream as provided by liblzma (after handling the header).

What is the specification for the EOS marker? Does liblzma write it itself or does libzip have to add it manually? In which cases should libzip do that? How can libzip recognize if liblzma wrote it or not? What does it look like? The format seems woefully underspecified.

Neither of your two examples has the bit set, so I assume neither comes with such an EOS marker. Do you know a tool that writes one?

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 25, 2020

If we're talking about the PK\07\010 marker, that one was already handled correctly before.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

I found this comment by the author of XZ Utils, Lasse Collin:

The lzma_alone_encoder() API in liblzma always adds the end marker. Currently there is no public API in liblzma to encode without the end marker. If you don't want the end marker, you need to use LZMA SDK or XZ for Java.

In 2008, he also said:

Note that ZIP allows LZMA with and without "end of stream (EOS)" marker (I usually call it "end of payload marker (EOPM)"). lzma_alone_decoder() expects to find this marker if and only if the uncompressed size is known. So you cannot unconditionally insert the uncompressed size from ZIP headers to the stream going to LZMA_Alone format decoder.

And in the same thread, Igor Pavlov (7-Zip author) said:

There are two types of LZMA streams:

  • Stream with end mark. That end mark adds about 6 bytes to compressed size.
  • Stream without end mark. You must know exact uncompressed size to decompress such stream.

LZMA EOS marker is additional data sequence in compressed stream. But only lzma decoder can recognize that EOS marker. It must unpack whole stream before marker.


Do you know a tool that writes one?

I figured out that we could enable the marker in 7-Zip by specifying the eos switch, which means:
7z a with-eos.zip abac-repeat.txt -mx=1 -meos+ -mm=LZMA
7z a without-eos.zip abac-repeat.txt -mx=1 -mm=LZMA

Test files:

Producer File Bit 1 set? (APPNOTE Section 4.4.4) Extracts in WinZip? Extracts in 7-Zip? Extracts in p7zip?
7-Zip 19.00 lzma-with-eos-7z-1900.zip Yes Yes Yes Yes
Vanilla p7zip 16.02 lzma-with-eos-vanilla-p7zip-1602.zip No Yes No (Headers Error: abac-repeat.txt) Yes
Explanation

From the above table:

  • p7zip is buggy and does not set the required bit, although it accepts both files.
  • WinZip is forgiving and does not warn / throw error when extracting the buggy file by p7zip.
  • 7-Zip seems to follow the specification and produces a correct file, and it refuses to open the buggy file.

It will be a good idea to contact Lasse Collin regarding the use of EOS marker with the API.

To my understanding, liblzma should be responsible for writing the EOS marker, and libzip should set the bit.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Test file produced by @jinfeihan57's p7zip (used version: commit b07be46):
lzma-with-eos-szcnick-p7zip-b07be46.zip

His version seems better than the vanilla p7zip - the output file has Bit 1 set and can be extracted by 7-Zip.

0-wiz-0 added a commit that referenced this issue Jul 26, 2020
Also fix a bug in gpbf merging while here.

Addresses #198.
@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 26, 2020

Thanks for the pointers! libzip now sets Bit 1 correctly.
In my testing it seems that lzma decoding works with and without the EOS/EOPM bytes, even if we don't explicitly pass the uncompressed size to the decoder (despite the notes to the contrary). Perhaps support for this was added upstream?
Anyway, please test git head and let me know if there are further issues!

@ghost
Copy link
Author

ghost commented Jul 27, 2020

Thank you very much. I have got mixed results.

Good news

Compression

I tested git head and this is the file produced by Ark using libzip backend:
lzma-libzip-ca24692.zip

The file size is smaller than the test files above (161 B vs 197 B). Nonetheless, it can be successfully extracted by the following programs:

  • WinZip
  • 7-Zip
  • p7zip (Vanilla version)
  • The Unarchiver (unar)
  • libarchive (bsdtar)

Note: In 7-Zip and p7zip, the method is displayed as "LZMA:eos".

Decompression

libzip extracts all the above test files fine.

Bad(?) news

When using Ark (with only libzip backend enabled) to open this file from KDE Bug 387996, Ark throws the following error:
failed to read data for entry: United-Arch/United-Arch-Dark/metacity-1/titlebutton-close-backdrop.png

So it turns out that libzip finds a CRC error for that file:

ziptool United-Arch.zip cat 408 > 408.png
can't read file at index '408': CRC error

But other programs such as unar, bsdtar and 7z (p7zip) can successfully extract that file.

@jinfeihan57
Copy link

According to APPNOTE section 5.8.8. the first two byte should be the version of lzma.

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 27, 2020

According to APPNOTE section 5.8.8. the first two byte should be the version of lzma.

I read this differently - it should be the version of the LZMA SDK used when creating the data. But this version is not provided by the lzma library, so I hardcoded it. Do you have a suggestion for a better version number?

@ghost
Copy link
Author

ghost commented Jul 27, 2020

It should be the version of the LZMA SDK used when creating the data

Correct.

Do you have a suggestion for a better version number?

Will lzma_version_number or lzma_version_string (Reference: src/liblzma/common/common.c from XZ Utils) help?

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 27, 2020

No, as I said, liblzma does not provide this information.
The lzma_version_number is an int32 while the data in the lzma stream is just two bytes, i.e. int16.
Also when looking at numbers in the archives provided in this issue, they are 9.14 and 10.2 and the lzmalib version numbers are in the 5.x.y range.

@jinfeihan57
Copy link

jinfeihan57 commented Jul 27, 2020

@0-wiz-0
The lzma version macros are available in the lzma library.
You can find in here: xz-5.2.5\src\liblzma\api\lzma\version.h
Line: 23
#define LZMA_VERSION_MAJOR 5
#define LZMA_VERSION_MINOR 2
If you installed lzma library it‘s supposed to be :/usr/include/lzma/version.h

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 27, 2020

As I said before, this is the version of the library, not the version of the LZMA SDK used.

 * \note        The version number of liblzma has nothing to with
 *              the version number of Igor Pavlov's LZMA SDK.
 */
#define LZMA_VERSION (LZMA_VERSION_MAJOR * UINT32_C(10000000) \
                + LZMA_VERSION_MINOR * UINT32_C(10000) \
                + LZMA_VERSION_PATCH * UINT32_C(10) \
                + LZMA_VERSION_STABILITY)

but the file format specifies to use the LZMA SDK version number.

dillof added a commit that referenced this issue Jul 27, 2020
@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 27, 2020

Please try c025b25, this version has no CRC errors with United-Arch.zip any longer.

@ghost
Copy link
Author

ghost commented Jul 27, 2020

Also when looking at numbers in the archives provided in this issue, they are 9.14 and 10.2 and the lzmalib version numbers are in the 5.x.y range.

I see what you mean now.

Hardcoding the LZMA SDK version is not a problem because the version number is purely informational (cosmetic) and is rather useless.

For example, Zip-Ada hardcodes 16.02 as version number. (Reference)
Zip-Ada has their own LZMA implementation (Reference) so an LZMA SDK version number is not available for their case.

@ghost
Copy link
Author

ghost commented Jul 28, 2020

The latest commit solves the CRC error, thank you.

Out-of-topic content (See KDE Bug 424740)

I found a minor, mostly cosmetic bug related to the United-Arch.zip.

Please refer to the following $ stat United-Arch output and you will notice that:

  1. libzip's access date's year is totally wrong
  2. Difference in "modify" timestamp by p7zip and libzip
  • Vanilla p7zip
Access: 2020-07-28 07:54:35.809764107 +0700
Modify: 2017-09-01 01:17:22.000000000 +0700
Change: 2020-07-28 07:54:32.336149874 +0700
  • libzip (1st attempt)
Access: 4436086-12-21 13:05:36.000000000 +0700
Modify: 2017-08-31 14:17:24.000000000 +0700
Change: 2020-07-28 07:53:29.294370405 +0700
  • libzip (2nd attempt)
Access: 4426948-12-29 00:24:16.000000000 +0700
Modify: 2017-08-31 14:17:24.000000000 +0700
Change: 2020-07-28 07:55:35.171233540 +0700

The tests were done on tmpfs.

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 28, 2020

I'd like this ticket to stay on topic of the lzma support, so if we consider that the time stamp issues are a problem, there should be a separate ticket for that, please.

Are there any open issues in the lzma support?

That being said:
For interpreting the time stamp in the archive, libzip follows InfoZIP, which uses localtime. I guess p7zip uses UTC. This is not defined clearly in the appnote, AFAICT.

I don't know how the Access and Change timestamps in your example came to be; libzip does not set those at all since it does not write extracted files to the disk itself.

If you write an unzipper, you should probably support either of the EF time stamps like the NTFS EF (0x000a) or the UNIX EF (0x000d).

@ghost
Copy link
Author

ghost commented Jul 28, 2020

Thanks for the explanation. I guess this may be Ark's problem then.

I am not aware of any other issue in the LZMA support - thank you very much for adding it!

Tip: Don't forget to edit TODO.md.

@0-wiz-0
Copy link
Member

0-wiz-0 commented Jul 28, 2020

Thanks, I've updated TODO.
I've also added you to THANKS. If you want to be mentioned differently than with your github login, just let me know!

Thank you for fetching together all the information and the test files, that was very helpful!

@0-wiz-0 0-wiz-0 closed this as completed Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug libzip doesn't behave as expected.
Projects
None yet
Development

No branches or pull requests

2 participants