Attempt to handle decompression error on some broken PDF files #637

sthenault · 2021-07-01T15:31:29Z

from times to times we go through files where no text is detected, while readers
like evince reads the pdf nicely. After digging it occured this is because the
PDF includes some badly compressed data. This may be fixed by uncompressing byte
per byte and ignoring the error on the last check bytes (arbitrarily found to be
the 3 last).

This has been largely inspired by py-pdf/pypdf#422
and the test file has been taken from there, so credits to @zegrep.

Fix #636

Checklist

[* ] I have added tests that prove my fix is effective or that my feature
works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial
version
I have updated the README.md or I am verified that this
is not necessary
I have updated the readthedocs documentation or I
verified that this is not necessary
I have added a consice human-readable description of the change to
CHANGELOG.md

@zegrep

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

pietermarsman

Looks good. Like the test case.

Suggested some small changes to improve code quality.

pietermarsman · 2021-09-17T19:12:36Z

pdfminer/pdftypes.py

+        if i < len(data) - 3:
+            raise


This 3 seems to be cherry-picked specifically for the PDF in case. Maybe just always ignore the error if not in settings.STRICTmode?

Well it has been empirically found using this file and another one :) The idea is that the error fixed here comes from some CRC check at the end of the data, which has been found to be in the last 4 bytes. This could probably be confirmed by the zlib spec. https://docs.python.org/3/library/zlib.html#zlib.adler32 is about int32 checksum, which seems consistent.

If we just plain raise the error in STRICT mode, the corrupted file won't be readable in this mode as before this patch. This may be fine but could be even done before calling decompress_corrupted

Will a some comment about this in the code.

If it is in settings.STRICT mode it should raise the error asap and not try the decompress_corrupted() at all. That's how things are right now, so no changes needed there.

Since decompress_corrupted only deals with the non strict mode, I think it's fair to just extract as much as possible and not raise the error ever.

But I can also see your argument that when it raises an error before the last 4 bytes there is data missing. And we should make the user aware of that. Maybe raise a warning in that case? See pdfpage.py:140 for an example on how to do that.

If it is in settings.STRICT mode it should raise the error asap and not try the decompress_corrupted() at all.

It is still trying decompress_corrupted() in strict mode. That should not happen.

pdfminer/pdftypes.py

@zegrep

from times to times we go through files where no text is detected, while readers like evince reads the pdf nicely. After digging it occured this is because the PDF includes some badly compressed data. This may be fixed by uncompressing byte per byte and ignoring the error on the last check bytes (arbitrarily found to be the 3 last). This has been largely inspired by py-pdf/pypdf#422 and the test file has been taken from there, so credits to @zegrep.

where zlib error is detected before the CRC checksum.

…rning

* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)

pietermarsman requested changes Sep 17, 2021

View reviewed changes

sthenault force-pushed the corrupted-compression branch from 0433a55 to c92e3a9 Compare September 20, 2021 10:20

Use a warnings instead of raising exception

6cce5ce

where zlib error is detected before the CRC checksum.

sthenault force-pushed the corrupted-compression branch from 8c3a4fc to 6cce5ce Compare September 29, 2021 08:15

sthenault and others added 5 commits September 29, 2021 10:16

Merge branch 'develop' into corrupted-compression

40da8ce

Merge branch 'develop' into corrupted-compression

c753677

Add line to CHANGELOG.md

4438396

Only try decompressing if not in strict mode

f82a61f

Change error into warning because warning.warn needs a subclass of Wa…

cf11e4e

…rning

pietermarsman merged commit 10f6fb4 into pdfminer:develop Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to handle decompression error on some broken PDF files #637

Attempt to handle decompression error on some broken PDF files #637

sthenault commented Jul 1, 2021 •

edited by pietermarsman

Loading

pietermarsman left a comment

pietermarsman Sep 17, 2021

sthenault Sep 20, 2021

pietermarsman Sep 27, 2021

sthenault Sep 29, 2021

pietermarsman Dec 11, 2021

Attempt to handle decompression error on some broken PDF files #637

Attempt to handle decompression error on some broken PDF files #637

Conversation

sthenault commented Jul 1, 2021 • edited by pietermarsman Loading

pietermarsman left a comment

Choose a reason for hiding this comment

pietermarsman Sep 17, 2021

Choose a reason for hiding this comment

sthenault Sep 20, 2021

Choose a reason for hiding this comment

pietermarsman Sep 27, 2021

Choose a reason for hiding this comment

sthenault Sep 29, 2021

Choose a reason for hiding this comment

pietermarsman Dec 11, 2021

Choose a reason for hiding this comment

sthenault commented Jul 1, 2021 •

edited by pietermarsman

Loading