ROB : cope with invalid length in streams #861

pubpub-zz · 2022-05-06T16:18:14Z

issue #301

in case of invalid extract stream data looking for endstream

Most importantly, the means Python 2.7 no longer needs to get supported

* All of them are removed from the package distributions * Scripts is additionally moved to the cpdf project * Sample_Code is moved to the docs

PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`. The new behavior is `overwriteWarnings=False`. Additionally, PyPDF2.utils.formatWarning was removed

Closes py-pdf#829

…#848) It's expected that this is a more sensible default for most users.

As support for Python 3.5 and lower was dropped, we can use more modern syntax

This includes adding a type m

issue py-pdf#301 in case of invalid extract stream data looking for endstream

codecov · 2022-05-06T16:32:59Z

Codecov Report

Merging #861 (d4932a8) into 2.0.0-dev (b580a45) will increase coverage by 0.13%.
The diff coverage is 90.90%.

@@              Coverage Diff              @@
##           2.0.0-dev     #861      +/-   ##
=============================================
+ Coverage      82.25%   82.38%   +0.13%     
=============================================
  Files             15       15              
  Lines           3640     3662      +22     
  Branches         781      787       +6     
=============================================
+ Hits            2994     3017      +23     
+ Misses           477      476       -1     
  Partials         169      169

Impacted Files	Coverage Δ
PyPDF2/generic.py	`86.15% <90.90%> (+0.12%)`	⬆️
PyPDF2/_reader.py	`81.79% <0.00%> (+0.39%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b580a45...d4932a8. Read the comment docs.

MartinThoma · 2022-05-06T19:41:18Z

PyPDF2/generic.py

+
+        def readUnsizedFromSteam(stream, pdf):
+            # we are just pointing at beginning of the stream
+            eon = getNextObjPos(stream.tell(), 2**32, [g for g in pdf.xref], pdf) - 1


what does "eon" mean?

end before next object : we have to find when to stop looking for endstream

MartinThoma · 2022-05-06T19:45:47Z

PyPDF2/generic.py

+                elif not pdf.strict:
+                    stream.seek(pstart, 0)
+                    data["__streamdata__"] = readUnsizedFromSteam(stream, pdf)
+                    pos = stream.tell()


Let me check my understanding: If this part is reached, then we are in a situation in which the "endstream" marker was not found where it should be. As we are in best effort mode, we go back to the very beginning of the stream. From there, we read as much as necessary. Right?

we read up to the endstream tag

MartinThoma · 2022-05-06T19:47:34Z

PyPDF2/generic.py

+            else:
+                return getNextObjPos(p, p1, remGens[1:], pdf)
+
+        def readUnsizedFromSteam(stream, pdf):


I don't understand this function. How does it know where the ream ends? Couldn't there be multiple objects in a stream?

Can you point me to a resource where I can read up on PDF streams in general? I would appreciate that a lot 🙏

From my analysis/tests, the stream were most of the time compressed/encoded
Also we have to remember that this function will only be used to load some data when the PDF file is corrupted.

I peek up some information in the standard
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
and also in those:
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.html

for ref also so test files
https://github.com/pdf-association/pdf20examples

MartinThoma · 2022-05-06T19:55:59Z

You're on fire! Pretty nice! I'll try to understand + mehr it to the 2.0 version tomorrow (edit: 2.0 I mean)

The 2.0.0 release of PyPDF2 includes three core changes: 1. Dropping support for Python 3.5 and older. 2. Introducing type annotations. 3. Interface changes, mostly to have PEP8-compliant names We introduced a [deprecation process](#930) that hopefully helps users to avoid unexpected breaking changes. Breaking Changes(DEP): - PyPDF2 2.0 requires Python 3.6+. Python 2.7 and 3.5 support were dropped. - PdfFileReader: The "warndest" parameter was removed - PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings` parameter. The new behavior is `overwriteWarnings=False`. - merger: OutlinesObject was removed without replacement. - merger.py ➔ _merger.py: You must import PdfFileMerger from PyPDF2 directly. - utils: * `ConvertFunctionsToVirtualList` was removed * `formatWarning` was removed * `isInt(obj)`: Use `instance(obj, int)` instead * `u_(s)`: Use `s` directly * `chr_(c)`: Use `chr(c)` instead * `barray(b)`: Use `bytearray(b)` instead * `isBytes(b)`: Use `instance(b, type(bytes()))` instead * `xrange_fn`: Use `range` instead * `string_type`: Use `str` instead * `isString(s)`: Use `instance(s, str)` instead * `_basestring`: Use `str` instead * All Exceptions are now in `PyPDF2.errors`: - PageSizeNotDefinedError - PdfReadError - PdfReadWarning - PyPdfError - `PyPDF2.pdf` (the `pdf` module) no longer exists. The contents were moved with the library. You should most likely import directly from `PyPDF2` instead. The `RectangleObject` is in `PyPDF2.generic`. - The `Resources`, `Scripts`, and `Tests` will no longer be part of the distribution files on PyPI. This should have little to no impact on most people. The `Tests` are renamed to `tests`, the `Resources` are renamed to `resources`. Both are still in the git repository. The `Scripts` are now in https://github.com/py-pdf/cpdf. `Sample_Code` was moved to the `docs`. For a full list of deprecated functions, please see the changelog of version 1.28.0. New Features (ENH): - Improve space setting for text extraction (#922) - Allow setting the decryption password in PdfReader.__init__ (#920) - Add Page.add_transformation (#883) Bug Fixes (BUG): - Fix error adding transformation to page without /Contents (#908) Robustness (ROB): - Cope with invalid length in streams (#861) Documentation (DOC): - Fix style of 1.25 and 1.27 patch notes (#927) - Transformation (#907) Developer Experience (DEV): - Create flake8 config file (#916) - Use relative imports (#875) Maintenance (MAINT): - Use Python 3.6 language features (#849) - Add wrapper function for PendingDeprecationWarnings (#928) - Use new PEP8 compliant names (#884) - Explicitly represent transformation matrix (#878) - Inline PAGE_RANGE_HELP string (#874) - Remove unnecessary generics imports (#873) - Remove star imports (#865) - merger.py ➔ _merger.py (#864) - Type annotations for all functions/methods (#854) - Add initial type support with mypy (#853) Testing (TST): - Regression test for xmp_metadata converter (#923) - Checkout submodule sample-files for benchmark - Add text extracting performance benchmark - Use new PyPDF2 API in benchmark (#902) - Make test suite fail for uncaught warnings (#892) - Remove -OO testrun from CI (#901) - Improve tests for convert_to_int (#899) Full Changelog: 1.28.4...2.0.0

MartinThoma and others added 11 commits May 2, 2022 13:32

DEP: Drop pre-Python 3.6 support (py-pdf#842)

e056e46

Most importantly, the means Python 2.7 no longer needs to get supported

DEP: Remove PyPDF2.pdf module (py-pdf#844)

0bc9023

DEP: Remove Scripts, Resources, Tests, Sample_Code (py-pdf#845)

f4d86b9

* All of them are removed from the package distributions * Scripts is additionally moved to the cpdf project * Sample_Code is moved to the docs

DEP: overwriteWarnings parameter of reader/merger (py-pdf#846)

50d091a

PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`. The new behavior is `overwriteWarnings=False`. Additionally, PyPDF2.utils.formatWarning was removed

DEP: Remove merger.OutlinesObject (py-pdf#847)

9b5e25d

Closes py-pdf#829

Change default to strict=False in PdfFileReader/PdfFileMerger (py-pdf…

7ed2708

…#848) It's expected that this is a more sensible default for most users.

MAINT: Use Python 3.6 language features (py-pdf#849)

116f631

As support for Python 3.5 and lower was dropped, we can use more modern syntax

MAINT: Add initial type support with mypy (py-pdf#853)

30211d9

This includes adding a type m

ROB: cope with invalid length in streams

9c0b549

issue py-pdf#301 in case of invalid extract stream data looking for endstream

Merge branch '2.0.0-dev' into iss301a

d22c0d6

Update test_generic.py

d4932a8

MartinThoma reviewed May 6, 2022

View reviewed changes

MartinThoma mentioned this pull request May 7, 2022

ROB : cope with invalid length in streams #862

Closed

MartinThoma merged commit e48bc6d into py-pdf:2.0.0-dev May 7, 2022

MartinThoma mentioned this pull request Jun 19, 2022

Try to handle pdf files with invalid stream length #250

Closed

pubpub-zz deleted the iss301a branch August 8, 2022 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROB : cope with invalid length in streams #861

ROB : cope with invalid length in streams #861

pubpub-zz commented May 6, 2022 •

edited

Loading

codecov bot commented May 6, 2022

MartinThoma May 6, 2022

pubpub-zz May 6, 2022

MartinThoma May 6, 2022

pubpub-zz May 6, 2022

MartinThoma May 6, 2022

pubpub-zz May 6, 2022

MartinThoma commented May 6, 2022 •

edited

Loading

ROB : cope with invalid length in streams #861

ROB : cope with invalid length in streams #861

Conversation

pubpub-zz commented May 6, 2022 • edited Loading

codecov bot commented May 6, 2022

Codecov Report

MartinThoma May 6, 2022

Choose a reason for hiding this comment

pubpub-zz May 6, 2022

Choose a reason for hiding this comment

MartinThoma May 6, 2022

Choose a reason for hiding this comment

pubpub-zz May 6, 2022

Choose a reason for hiding this comment

MartinThoma May 6, 2022

Choose a reason for hiding this comment

pubpub-zz May 6, 2022

Choose a reason for hiding this comment

MartinThoma commented May 6, 2022 • edited Loading

pubpub-zz commented May 6, 2022 •

edited

Loading

MartinThoma commented May 6, 2022 •

edited

Loading