-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to handle decompression error on some broken PDF files #637
Attempt to handle decompression error on some broken PDF files #637
Conversation
from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Like the test case.
Suggested some small changes to improve code quality.
pdfminer/pdftypes.py
Outdated
if i < len(data) - 3: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 3
seems to be cherry-picked specifically for the PDF in case. Maybe just always ignore the error if not in settings.STRICT
mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it has been empirically found using this file and another one :) The idea is that the error fixed here comes from some CRC check at the end of the data, which has been found to be in the last 4 bytes. This could probably be confirmed by the zlib spec. https://docs.python.org/3/library/zlib.html#zlib.adler32 is about int32 checksum, which seems consistent.
If we just plain raise the error in STRICT mode, the corrupted file won't be readable in this mode as before this patch. This may be fine but could be even done before calling decompress_corrupted
Will a some comment about this in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is in settings.STRICT
mode it should raise the error asap and not try the decompress_corrupted()
at all. That's how things are right now, so no changes needed there.
Since decompress_corrupted
only deals with the non strict mode, I think it's fair to just extract as much as possible and not raise the error ever.
But I can also see your argument that when it raises an error before the last 4 bytes there is data missing. And we should make the user aware of that. Maybe raise a warning in that case? See pdfpage.py:140 for an example on how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is in settings.STRICT mode it should raise the error asap and not try the decompress_corrupted() at all.
It is still trying decompress_corrupted()
in strict mode. That should not happen.
from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.
0433a55
to
c92e3a9
Compare
where zlib error is detected before the CRC checksum.
8c3a4fc
to
6cce5ce
Compare
* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)
from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).
This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.
Fix #636
Checklist
works
version
is not necessary
verified that this is not necessary
CHANGELOG.md