Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error decoding archives since 4.3 #65

Closed
luke-clifton opened this issue Mar 6, 2024 · 24 comments · Fixed by #66
Closed

Error decoding archives since 4.3 #65

luke-clifton opened this issue Mar 6, 2024 · 24 comments · Fixed by #66

Comments

@luke-clifton
Copy link

I've getting this error on some archives that work fine in other tools.

else fail "Content size mismatch in data descriptor record"

This is an example zip file that doesn't work.

https://releases.hashicorp.com/vault/1.15.6/vault_1.15.6_darwin_amd64.zip (sorry about the size).

Everything works in 4.2, which makes me suspect d78c534

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

Can you bisect to determine the exact commit that causes the issue?
I don't see anything obviously wrong with that commit.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

I can reproduce the issue with a fresh build of 0.4.2.2.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

I can also reproduce it with a fresh build of 0.4.2.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

Also 0.3.2.5. From what I can see, this does not appear to be a regression in this library. Either (a) the problem is due to a change in a dependent library, like binary or zlib, or (b) this has been one of those zips this library could never handle.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

Here's the relevant code:

  compressedData <- if bitflag .&. 0O10 == 0
      then getLazyByteString (fromIntegral compressedSize)
      else -- If bit 3 of general purpose bit flag is set,
           -- then we need to read until we get to the
           -- data descriptor record.
           do raw <- getCompressedData
              sig <- lookAhead getWord32le
              when (sig == 0x08074b50) $ skip 4
              skip 4 -- crc32
              cs <- getWord32le  -- compressed size
              skip 4 -- uncompressed size
              if fromIntegral cs == B.length raw
                 then return raw
                 else fail "Content size mismatch in data descriptor record"

Adding some trace on your input, I can see that sig == 0x08074b50 is False, cs == 1587007606, and B.length raw == 4498285.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

It's odd because the zip is 133M which should almost entirely be compressed data for one file. This doesn't seem to correspond to either cs or B.length raw above.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

OK, I think I have a clue. This zip contains

0044a3a0: 5eed 1044 3450 4b03 0493 cdf0 94c5 2e21  ^..D4PK........!

Note the sequence 504b0304 which is not aligned. Our library is reading that as a signature for a local file header, which we shouldn't be encountering here, because this zip only contains one file! I suspect that these signatures need to be aligned, and we need to check for that.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

EDIT: There is nothing in APPNOTE.TXT about the signature needing to be word-aligned.

To clarify the situation: some zips, like this one, don't encode the length of the compressed data at the beginning, so we have to find where it ends before passing it to zlib's decompress.

We do this by looking for certain signatures. The problem is that the compressed data may, in rare cases, accidentally contain the signature, and I think that's what is happening here.

The docs are a bit unclear but I think maybe you're intended to decompress the data in a streaming way, incrementally, and that THIS process will tell you when the compressed data stops. Unfortunately the public API of zlib doesn't give incremental decompression functions, though there is something in an Internal module we could try.

@luke-clifton
Copy link
Author

Ok, so, I can decompress that when the only change I make in my cabal freeze file is changing zip-archive from 0.4.3 to 0.4.2.

For what its worth, my freeze file also has binary ==0.8.9.1 and zlib ==0.6.3.0.

I can carve out some time tomorrow to dig deeper.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

Have you tried using 0.4.3 to unpack zips of different versions of vault, presumably created in similar fashion? If I am right in my hypothesis above, then we might expect it to work with similar zips with different content.

I will also try constraining zlib and binary.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

I can't reproduce my results from this morning (now on a different machine).

Can confirm that building 0.4.2 with --constrain 'zlib == 0.6.3.0' --constrain 'binary == 0.8.9.1' fixes the issue. Using zlib == 0.7.0.0. also works.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

OK, bisection seems to show that d78c534 introduced the problem.

@luke-clifton
Copy link
Author

Right, so getCompressedData should be looking for the signature, and then also checking the the other bits that are being checked in getLocalFile (crc, compressed length), and if they don't match up, assume that this was an accidental signature, and keep consuming bytes until we find the signature which also matches up with the other data? It would reduce the likelihood of an accidental match significantly.

Not encoding the length of the compressed data, and not being able to "escape" the signatures seems quite broken, and prone to accidental matches by design.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

Yeah, that can't be the design! -- I will see if there's a better way that I am missing.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

I think that for now we could try the thing you suggested and confirm the compressed length if we hit the signature, moving on if it doesn't match the length of what we took to be the data. That seems pretty kludgy but if it worked it would at least confirm my hypothesis about what is going on.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

Alas, the (perhaps accidental) signature we're getting is not the signature for the "data descriptor" record that MAY follow the compressed data (but not always!) and that contains the size and CRC data. It is the signature for a local file header. We shouldn't have a second local file header -- that's why I think it must be accidental. But it wouldn't contain length or CRC information relating to the preceding compressed data.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

According to the spec, the data descriptor MUST exist if bit 3 of the general purpose bit flag is set (as it is in this case -- this is the case that triggers use of getCompressedData; when bit 3 isn't set, we have information about the length of the compressed data, so that's easy).

So, getCompressedData should insist on finding this data descriptor. But how? The spec says:

although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.

Question: how do we recognize the data descriptor in the case where there isn't a signature?

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

AT https://issues.apache.org/jira/browse/COMPRESS-103 I see:

This is what the InfoZIP appnote.iz has to say
Bit 3: If this bit is set, the fields crc-32, compressed
size and uncompressed size are set to zero in the
local header. The correct values are put in the
data descriptor immediately following the compressed
data. (Note: PKZIP version 2.04g for DOS only
recognizes this bit for method 8 compression, newer
versions of PKZIP recognize this bit for any
compression method.)
[Info-ZIP note: This bit was introduced by PKZIP 2.04 for
DOS. In general, this feature can only be reliably used
together with compression methods that allow intrinsic
detection of the "end-of-compressed-data" condition.
From
the set of compression methods described in this Zip archive
specification, only "deflate" and "bzip2" fulfill this
requirement. [boldface added]

This is what I suspected. So the right approach might be to figure out how to use the "Internal" streaming decompression function from zlib, which should tell us where the compressed data stops.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

https://hackage.haskell.org/package/zlib-0.7.0.0/docs/Codec-Compression-Zlib-Internal.html#g:4

With decompressAllMembers False the decompressor knows when the data ends and will produce DecompressStreamEnd without you having to supply an empty chunk to indicate the end of the input.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

I think foldDecompressStreamWithInput should work; just need to figure out what parameters to use.

We could use for a a DecompressResult type like

data DecompressResult =
  DecompressSuccess
  { decompressChunks :: [S.ByteString]
  , decompressRemainder :: L.ByteString }
 | DecompressFailure DecompressError

and then something like

foldDecompressStreamWithInput
  (\bs res ->
      case res of
        DecompressSuccess chunks remainder -> DecompressSuccess (bs:chunks) remainder
        DecompressFailure err -> DecompressFailure err)
  (\remainder -> DecompressSuccess [] remainder)
  (\err -> DecompressFailure err)

jgm added a commit that referenced this issue Mar 7, 2024
This fixes a problem that arises for local files with bit 3
of the general purpose bit flag set. In this case, we don't
get information up front about the size of the compressed
data.  So how do we know where the compressed data ends?
Previously, we tried to determine this by looking for the
signature of the data descriptor. But this doesn't always
HAVE a signature, and it is also possible for signatures to
occur accidentally in the compressed data itself.

Here we follow a clue from an Info-ZIP note:
"In general, this feature can only be reliably used
together with compression methods that allow intrinsic
detection of the 'end-of-compressed-data' condition."

We use the streaming decompression interface from
zlib's Internal module.  This tells us, in effect, where
the compressed data ends.

Closes #65.
@jgm
Copy link
Owner

jgm commented Mar 7, 2024

I have tried this in PR #66.
I can confirm that with this fix, we can decompress the test zip above!
However, some of the tests are failing and I'm not sure yet why.

@jgm
Copy link
Owner

jgm commented Mar 7, 2024

It is failing on

                             --   , testExtractFilesFailOnEncrypted
                             --   , testPasswordProtectedRead
                             --   , testIncorrectPasswordRead

with

Data.Binary.Get.runGet at position 66: Codec.Compression.Zlib: compressed data stream format error (invalid stored block lengths)

The problem is that when the data is compressed or password-protected, the streaming decode fails and we don't get an archive. Previously we'd store the compressed data; we just wouldn't be able to decompress it.

We could fix this by going back to the old method when the content is encrypted.

@jgm jgm closed this as completed in #66 Mar 7, 2024
jgm added a commit that referenced this issue Mar 7, 2024
This fixes a problem that arises for local files with bit 3
of the general purpose bit flag set. In this case, we don't
get information up front about the size of the compressed
data.  So how do we know where the compressed data ends?
Previously, we tried to determine this by looking for the
signature of the data descriptor. But the data descriptor doesn't
always HAVE a signature, and it is also possible for signatures to
occur accidentally in the compressed data itself (#65).

Here we follow a clue from an Info-ZIP note:
"In general, this feature can only be reliably used
together with compression methods that allow intrinsic
detection of the 'end-of-compressed-data' condition."

We use the streaming decompression interface from
zlib's Internal module.  This tells us, in effect, where
the compressed data ends.

A parameter has been added to getCompressedData for
the compressionMethod, since we only want to do streaming
decompression if the data is compressed with Deflate.

Closes #65.
@jgm
Copy link
Owner

jgm commented Mar 7, 2024

Just a note that fixing this issue also fixed #25 -- so, thanks for reporting!

@luke-clifton
Copy link
Author

That's awesome. #25 has been around for a while 😄!

Thanks for looking into this. I'd carved out some time today to look into it, I guess I'll use that time to update dependencies instead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants