-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error decoding archives since 4.3 #65
Comments
Can you bisect to determine the exact commit that causes the issue? |
I can reproduce the issue with a fresh build of 0.4.2.2. |
I can also reproduce it with a fresh build of 0.4.2. |
Also 0.3.2.5. From what I can see, this does not appear to be a regression in this library. Either (a) the problem is due to a change in a dependent library, like binary or zlib, or (b) this has been one of those zips this library could never handle. |
Here's the relevant code: compressedData <- if bitflag .&. 0O10 == 0
then getLazyByteString (fromIntegral compressedSize)
else -- If bit 3 of general purpose bit flag is set,
-- then we need to read until we get to the
-- data descriptor record.
do raw <- getCompressedData
sig <- lookAhead getWord32le
when (sig == 0x08074b50) $ skip 4
skip 4 -- crc32
cs <- getWord32le -- compressed size
skip 4 -- uncompressed size
if fromIntegral cs == B.length raw
then return raw
else fail "Content size mismatch in data descriptor record" Adding some trace on your input, I can see that |
It's odd because the zip is 133M which should almost entirely be compressed data for one file. This doesn't seem to correspond to either |
OK, I think I have a clue. This zip contains
Note the sequence 504b0304 which is not aligned. Our library is reading that as a signature for a local file header, which we shouldn't be encountering here, because this zip only contains one file! I suspect that these signatures need to be aligned, and we need to check for that. |
EDIT: There is nothing in APPNOTE.TXT about the signature needing to be word-aligned. To clarify the situation: some zips, like this one, don't encode the length of the compressed data at the beginning, so we have to find where it ends before passing it to zlib's decompress. We do this by looking for certain signatures. The problem is that the compressed data may, in rare cases, accidentally contain the signature, and I think that's what is happening here. The docs are a bit unclear but I think maybe you're intended to decompress the data in a streaming way, incrementally, and that THIS process will tell you when the compressed data stops. Unfortunately the public API of zlib doesn't give incremental decompression functions, though there is something in an Internal module we could try. |
Ok, so, I can decompress that when the only change I make in my cabal freeze file is changing For what its worth, my freeze file also has I can carve out some time tomorrow to dig deeper. |
Have you tried using 0.4.3 to unpack zips of different versions of vault, presumably created in similar fashion? If I am right in my hypothesis above, then we might expect it to work with similar zips with different content. I will also try constraining zlib and binary. |
I can't reproduce my results from this morning (now on a different machine). Can confirm that building 0.4.2 with |
OK, bisection seems to show that d78c534 introduced the problem. |
Right, so Not encoding the length of the compressed data, and not being able to "escape" the signatures seems quite broken, and prone to accidental matches by design. |
Yeah, that can't be the design! -- I will see if there's a better way that I am missing. |
I think that for now we could try the thing you suggested and confirm the compressed length if we hit the signature, moving on if it doesn't match the length of what we took to be the data. That seems pretty kludgy but if it worked it would at least confirm my hypothesis about what is going on. |
Alas, the (perhaps accidental) signature we're getting is not the signature for the "data descriptor" record that MAY follow the compressed data (but not always!) and that contains the size and CRC data. It is the signature for a local file header. We shouldn't have a second local file header -- that's why I think it must be accidental. But it wouldn't contain length or CRC information relating to the preceding compressed data. |
According to the spec, the data descriptor MUST exist if bit 3 of the general purpose bit flag is set (as it is in this case -- this is the case that triggers use of getCompressedData; when bit 3 isn't set, we have information about the length of the compressed data, so that's easy). So, getCompressedData should insist on finding this data descriptor. But how? The spec says:
Question: how do we recognize the data descriptor in the case where there isn't a signature? |
AT https://issues.apache.org/jira/browse/COMPRESS-103 I see:
This is what I suspected. So the right approach might be to figure out how to use the "Internal" streaming decompression function from zlib, which should tell us where the compressed data stops. |
https://hackage.haskell.org/package/zlib-0.7.0.0/docs/Codec-Compression-Zlib-Internal.html#g:4
|
I think We could use for data DecompressResult =
DecompressSuccess
{ decompressChunks :: [S.ByteString]
, decompressRemainder :: L.ByteString }
| DecompressFailure DecompressError and then something like foldDecompressStreamWithInput
(\bs res ->
case res of
DecompressSuccess chunks remainder -> DecompressSuccess (bs:chunks) remainder
DecompressFailure err -> DecompressFailure err)
(\remainder -> DecompressSuccess [] remainder)
(\err -> DecompressFailure err) |
This fixes a problem that arises for local files with bit 3 of the general purpose bit flag set. In this case, we don't get information up front about the size of the compressed data. So how do we know where the compressed data ends? Previously, we tried to determine this by looking for the signature of the data descriptor. But this doesn't always HAVE a signature, and it is also possible for signatures to occur accidentally in the compressed data itself. Here we follow a clue from an Info-ZIP note: "In general, this feature can only be reliably used together with compression methods that allow intrinsic detection of the 'end-of-compressed-data' condition." We use the streaming decompression interface from zlib's Internal module. This tells us, in effect, where the compressed data ends. Closes #65.
I have tried this in PR #66. |
It is failing on
with
The problem is that when the data is compressed or password-protected, the streaming decode fails and we don't get an archive. Previously we'd store the compressed data; we just wouldn't be able to decompress it. We could fix this by going back to the old method when the content is encrypted. |
This fixes a problem that arises for local files with bit 3 of the general purpose bit flag set. In this case, we don't get information up front about the size of the compressed data. So how do we know where the compressed data ends? Previously, we tried to determine this by looking for the signature of the data descriptor. But the data descriptor doesn't always HAVE a signature, and it is also possible for signatures to occur accidentally in the compressed data itself (#65). Here we follow a clue from an Info-ZIP note: "In general, this feature can only be reliably used together with compression methods that allow intrinsic detection of the 'end-of-compressed-data' condition." We use the streaming decompression interface from zlib's Internal module. This tells us, in effect, where the compressed data ends. A parameter has been added to getCompressedData for the compressionMethod, since we only want to do streaming decompression if the data is compressed with Deflate. Closes #65.
Just a note that fixing this issue also fixed #25 -- so, thanks for reporting! |
That's awesome. #25 has been around for a while 😄! Thanks for looking into this. I'd carved out some time today to look into it, I guess I'll use that time to update dependencies instead! |
I've getting this error on some archives that work fine in other tools.
zip-archive/src/Codec/Archive/Zip.hs
Line 796 in fa62708
This is an example zip file that doesn't work.
https://releases.hashicorp.com/vault/1.15.6/vault_1.15.6_darwin_amd64.zip (sorry about the size).
Everything works in 4.2, which makes me suspect d78c534
The text was updated successfully, but these errors were encountered: