Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib.error: Error -3 while decompressing data: incorrect data check #422

Closed
zegrep opened this issue Apr 25, 2018 · 3 comments
Closed

zlib.error: Error -3 while decompressing data: incorrect data check #422

zegrep opened this issue Apr 25, 2018 · 3 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

@zegrep
Copy link

zegrep commented Apr 25, 2018

We have do deal with a huge amount of broken PDF files. The creator is "jsPDF 1.x-master". These files are not totally corrupted. It would be nice to get the readable content.
I found a solution on stackoverflow and it works fine for our needs.

PyPDF2/filters.py

    def decompress(data):
        try:
            return zlib.decompress(data)
        except zlib.error:
            return decompress_corrupted(data)

    def decompress_corrupted(data):
        d = zlib.decompressobj(zlib.MAX_WBITS | 32)
        f = StringIO(data)
        result_str = b''
        buffer = f.read(1)
        try:
            while buffer:
                result_str += d.decompress(buffer)
                buffer = f.read(1)
        except zlib.error:
            pass
        return result_str
@zegrep
Copy link
Author

zegrep commented May 30, 2018

@Pragabhava
Copy link

This fix helped me deal with an Error -5 while decompressing data: incomplete or truncated stream for what I think is an improper end of line handling of a byte stream (Windows \r\n vs. Linux \n). I only had to replace f = StringIO(data) for f = BytesIO(data)

sthenault pushed a commit to sthenault/pdfminer.six that referenced this issue Jul 1, 2021
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
sthenault pushed a commit to sthenault/pdfminer.six that referenced this issue Sep 20, 2021
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data (unproper checksum). This may be fixed by
uncompressing byte per byte and ignoring the error on the checksum bytes (arbitrarily
found to be the 4 last, which seems consistent with a int32 checksum).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
sthenault pushed a commit to sthenault/pdfminer.six that referenced this issue Sep 20, 2021
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
TVC-ScS added a commit to TVC-ScS/PyPDF4 that referenced this issue Nov 24, 2021
There are some errors in some cases during zlib decompression (eg. I have a PDF with overlay of text, it is the same issue which is documented here py-pdf#422 ). With this change, the decompression is working without errors.
pietermarsman added a commit to pdfminer/pdfminer.six that referenced this issue Dec 11, 2021
* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.

* Attempt to handle decompression error on some broken PDF files

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.

* Use a warnings instead of raising exception

where zlib error is detected before the CRC checksum.

* Add line to CHANGELOG.md

* Only try decompressing if not in strict mode

* Change error into warning because warning.warn needs a subclass of Warning

Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr>
Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022
@MartinThoma
Copy link
Member

It seems as if this issue is gone with PyPDF2==2.3.0:

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader("zen_of_python_corrupted.pdf")
>>> reader.metadata
{'/Producer': 'GPL Ghostscript 9.18', '/CreationDate': "D:20180530133107+02'00'", '/ModDate': "D:20180530133107+02'00'", '/Title': 'zen_of_python.txt', '/Author': '', '/Creator': 'a2ps version 4.14'}

Thank you for reporting it!

goffauxs added a commit to odoo-dev/odoo that referenced this issue Mar 22, 2023
Pdf files with badly compressed data can throw zlib errors when the Odoo
banner is added in the corner (Original Bills). These files are readable
but cause a traceback when PyPDF2 tries to decompress the data. If we
decompress the blocks byte by byte and ignore the error, the resulting
file seems to be identical to the source file (aside from the odoo
banner).

Largely inspired by py-pdf/pypdf#422

opw-3151302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

No branches or pull requests

3 participants