-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zlib.error: Error -3 while decompressing data: incorrect data check #422
Labels
Has MCVE
A minimal, complete and verifiable example helps a lot to debug / understand feature requests
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
is-robustness-issue
From a users perspective, this is about robustness
Comments
This fix helped me deal with an |
sthenault
pushed a commit
to sthenault/pdfminer.six
that referenced
this issue
Jul 1, 2021
from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.
5 tasks
sthenault
pushed a commit
to sthenault/pdfminer.six
that referenced
this issue
Sep 20, 2021
from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data (unproper checksum). This may be fixed by uncompressing byte per byte and ignoring the error on the checksum bytes (arbitrarily found to be the 4 last, which seems consistent with a int32 checksum). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.
sthenault
pushed a commit
to sthenault/pdfminer.six
that referenced
this issue
Sep 20, 2021
from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.
TVC-ScS
added a commit
to TVC-ScS/PyPDF4
that referenced
this issue
Nov 24, 2021
There are some errors in some cases during zlib decompression (eg. I have a PDF with overlay of text, it is the same issue which is documented here py-pdf#422 ). With this change, the decompression is working without errors.
pietermarsman
added a commit
to pdfminer/pdfminer.six
that referenced
this issue
Dec 11, 2021
* Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Attempt to handle decompression error on some broken PDF files from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep. * Use a warnings instead of raising exception where zlib error is detected before the CRC checksum. * Add line to CHANGELOG.md * Only try decompressing if not in strict mode * Change error into warning because warning.warn needs a subclass of Warning Co-authored-by: Sylvain Thénault <sylvain.thenault@lowatt.fr> Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
It seems as if this issue is gone with
Thank you for reporting it! |
goffauxs
added a commit
to odoo-dev/odoo
that referenced
this issue
Mar 22, 2023
Pdf files with badly compressed data can throw zlib errors when the Odoo banner is added in the corner (Original Bills). These files are readable but cause a traceback when PyPDF2 tries to decompress the data. If we decompress the blocks byte by byte and ignore the error, the resulting file seems to be identical to the source file (aside from the odoo banner). Largely inspired by py-pdf/pypdf#422 opw-3151302
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Has MCVE
A minimal, complete and verifiable example helps a lot to debug / understand feature requests
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
is-robustness-issue
From a users perspective, this is about robustness
We have do deal with a huge amount of broken PDF files. The creator is "jsPDF 1.x-master". These files are not totally corrupted. It would be nice to get the readable content.
I found a solution on stackoverflow and it works fine for our needs.
PyPDF2/filters.py
The text was updated successfully, but these errors were encountered: