zlib.error: Error -3 while decompressing data: incorrect data check #422

zegrep · 2018-04-25T07:32:42Z

We have do deal with a huge amount of broken PDF files. The creator is "jsPDF 1.x-master". These files are not totally corrupted. It would be nice to get the readable content.
I found a solution on stackoverflow and it works fine for our needs.

PyPDF2/filters.py

    def decompress(data):
        try:
            return zlib.decompress(data)
        except zlib.error:
            return decompress_corrupted(data)

    def decompress_corrupted(data):
        d = zlib.decompressobj(zlib.MAX_WBITS | 32)
        f = StringIO(data)
        result_str = b''
        buffer = f.read(1)
        try:
            while buffer:
                result_str += d.decompress(buffer)
                buffer = f.read(1)
        except zlib.error:
            pass
        return result_str

zegrep · 2018-05-30T11:40:53Z

zen_of_python_corrupted.pdf

Pragabhava · 2021-06-30T04:45:17Z

This fix helped me deal with an Error -5 while decompressing data: incomplete or truncated stream for what I think is an improper end of line handling of a byte stream (Windows \r\n vs. Linux \n). I only had to replace f = StringIO(data) for f = BytesIO(data)

@zegrep

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

@zegrep

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data (unproper checksum). This may be fixed by uncompressing byte per byte and ignoring the error on the checksum bytes (arbitrarily found to be the 4 last, which seems consistent with a int32 checksum). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

@zegrep

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

There are some errors in some cases during zlib decompression (eg. I have a PDF with overlay of text, it is the same issue which is documented here py-pdf#422 ). With this change, the decompression is working without errors.

@zegrep

* Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Use a warnings instead of raising exception where zlib error is detected before the CRC checksum. * Add line to CHANGELOG.md * Only try decompressing if not in strict mode * Change error into warning because warning.warn needs a subclass of Warning Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>

MartinThoma · 2022-06-19T12:22:18Z

It seems as if this issue is gone with PyPDF2==2.3.0:

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader("zen_of_python_corrupted.pdf")
>>> reader.metadata
{'/Producer': 'GPL Ghostscript 9.18', '/CreationDate': "D:20180530133107+02'00'", '/ModDate': "D:20180530133107+02'00'", '/Title': 'zen_of_python.txt', '/Author': '', '/Creator': 'a2ps version 4.14'}

Thank you for reporting it!

Pdf files with badly compressed data can throw zlib errors when the Odoo banner is added in the corner (Original Bills). These files are readable but cause a traceback when PyPDF2 tries to decompress the data. If we decompress the blocks byte by byte and ignore the error, the resulting file seems to be identical to the source file (aside from the odoo banner). Largely inspired by py-pdf/pypdf#422 opw-3151302

sthenault mentioned this issue Jul 1, 2021

Attempt to handle decompression error on some broken PDF files pdfminer/pdfminer.six#637

Merged

5 tasks

TVC-ScS mentioned this issue Nov 24, 2021

Update filters.py according to pypdf2 (issue #422) claird/PyPDF4#101

Open

MartinThoma closed this as completed Jun 19, 2022

goffauxs mentioned this issue Mar 23, 2023

[FIX] odoo: PyPDF2 zlib incorrect data check odoo/odoo#116300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zlib.error: Error -3 while decompressing data: incorrect data check #422

zlib.error: Error -3 while decompressing data: incorrect data check #422

zegrep commented Apr 25, 2018 •

edited

Loading

zegrep commented May 30, 2018

Pragabhava commented Jun 30, 2021

MartinThoma commented Jun 19, 2022

zlib.error: Error -3 while decompressing data: incorrect data check #422

zlib.error: Error -3 while decompressing data: incorrect data check #422

Comments

zegrep commented Apr 25, 2018 • edited Loading

zegrep commented May 30, 2018

Pragabhava commented Jun 30, 2021

MartinThoma commented Jun 19, 2022

zegrep commented Apr 25, 2018 •

edited

Loading