Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB : cope with invalid length in streams #861

Merged
merged 11 commits into from
May 7, 2022

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented May 6, 2022

issue #301

in case of invalid extract stream data looking for endstream

MartinThoma and others added 11 commits May 2, 2022 13:32
Most importantly, the means Python 2.7 no longer needs to get supported
* All of them are removed from the package distributions
* Scripts is additionally moved to the cpdf project
* Sample_Code is moved to the docs
PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`.
The new behavior is `overwriteWarnings=False`.

Additionally, PyPDF2.utils.formatWarning was removed
…#848)

It's expected that this is a more sensible default for most users.
As support for Python 3.5 and lower was dropped, we can use more modern syntax
issue py-pdf#301

in case of invalid extract stream data looking for endstream
@codecov
Copy link

codecov bot commented May 6, 2022

Codecov Report

Merging #861 (d4932a8) into 2.0.0-dev (b580a45) will increase coverage by 0.13%.
The diff coverage is 90.90%.

@@              Coverage Diff              @@
##           2.0.0-dev     #861      +/-   ##
=============================================
+ Coverage      82.25%   82.38%   +0.13%     
=============================================
  Files             15       15              
  Lines           3640     3662      +22     
  Branches         781      787       +6     
=============================================
+ Hits            2994     3017      +23     
+ Misses           477      476       -1     
  Partials         169      169              
Impacted Files Coverage Δ
PyPDF2/generic.py 86.15% <90.90%> (+0.12%) ⬆️
PyPDF2/_reader.py 81.79% <0.00%> (+0.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b580a45...d4932a8. Read the comment docs.


def readUnsizedFromSteam(stream, pdf):
# we are just pointing at beginning of the stream
eon = getNextObjPos(stream.tell(), 2**32, [g for g in pdf.xref], pdf) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "eon" mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end before next object : we have to find when to stop looking for endstream

Comment on lines +684 to +687
elif not pdf.strict:
stream.seek(pstart, 0)
data["__streamdata__"] = readUnsizedFromSteam(stream, pdf)
pos = stream.tell()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check my understanding: If this part is reached, then we are in a situation in which the "endstream" marker was not found where it should be. As we are in best effort mode, we go back to the very beginning of the stream. From there, we read as much as necessary. Right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we read up to the endstream tag

else:
return getNextObjPos(p, p1, remGens[1:], pdf)

def readUnsizedFromSteam(stream, pdf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this function. How does it know where the ream ends? Couldn't there be multiple objects in a stream?

Can you point me to a resource where I can read up on PDF streams in general? I would appreciate that a lot 🙏

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my analysis/tests, the stream were most of the time compressed/encoded
Also we have to remember that this function will only be used to load some data when the PDF file is corrupted.

I peek up some information in the standard
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
and also in those:
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.html

for ref also so test files
https://github.com/pdf-association/pdf20examples

@MartinThoma
Copy link
Member

MartinThoma commented May 6, 2022

You're on fire! Pretty nice! I'll try to understand + mehr it to the 2.0 version tomorrow (edit: 2.0 I mean)

@MartinThoma MartinThoma merged commit e48bc6d into py-pdf:2.0.0-dev May 7, 2022
MartinThoma added a commit that referenced this pull request Jun 1, 2022
The 2.0.0 release of PyPDF2 includes three core changes:

1. Dropping support for Python 3.5 and older.
2. Introducing type annotations.
3. Interface changes, mostly to have PEP8-compliant names

We introduced a [deprecation process](#930)
that hopefully helps users to avoid unexpected breaking changes.

Breaking Changes(DEP):
- PyPDF2 2.0 requires Python 3.6+. Python 2.7 and 3.5 support were dropped.
- PdfFileReader: The "warndest" parameter was removed
- PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`
  parameter. The new behavior is `overwriteWarnings=False`.
- merger: OutlinesObject was removed without replacement.
- merger.py ➔ _merger.py: You must import PdfFileMerger from PyPDF2 directly.
- utils:
  * `ConvertFunctionsToVirtualList` was removed
  * `formatWarning` was removed
  * `isInt(obj)`: Use `instance(obj, int)` instead
  * `u_(s)`: Use `s` directly
  * `chr_(c)`: Use `chr(c)` instead
  * `barray(b)`: Use `bytearray(b)` instead
  * `isBytes(b)`: Use `instance(b, type(bytes()))` instead
  * `xrange_fn`: Use `range` instead
  * `string_type`: Use `str` instead
  * `isString(s)`: Use `instance(s, str)` instead
  * `_basestring`: Use `str` instead
  * All Exceptions are now in `PyPDF2.errors`:
    - PageSizeNotDefinedError
    - PdfReadError
    - PdfReadWarning
    - PyPdfError
- `PyPDF2.pdf` (the `pdf` module) no longer exists. The contents were moved with
  the library. You should most likely import directly from `PyPDF2` instead.
  The `RectangleObject` is in `PyPDF2.generic`.
- The `Resources`, `Scripts`, and `Tests` will no longer be part of the distribution
  files on PyPI. This should have little to no impact on most people. The
  `Tests` are renamed to `tests`, the `Resources` are renamed to `resources`.
  Both are still in the git repository. The `Scripts` are now in
  https://github.com/py-pdf/cpdf. `Sample_Code` was moved to the `docs`.

For a full list of deprecated functions, please see the changelog of version
1.28.0.

New Features (ENH):
-  Improve space setting for text extraction (#922)
-  Allow setting the decryption password in PdfReader.__init__ (#920)
-  Add Page.add_transformation (#883)

Bug Fixes (BUG):
-  Fix error adding transformation to page without /Contents (#908)

Robustness (ROB):
-  Cope with invalid length in streams (#861)

Documentation (DOC):
-  Fix style of 1.25 and 1.27 patch notes (#927)
-  Transformation (#907)

Developer Experience (DEV):
-  Create flake8 config file (#916)
-  Use relative imports (#875)

Maintenance (MAINT):
-  Use Python 3.6 language features (#849)
-  Add wrapper function for PendingDeprecationWarnings (#928)
-  Use new PEP8 compliant names (#884)
-  Explicitly represent transformation matrix (#878)
-  Inline PAGE_RANGE_HELP string (#874)
-  Remove unnecessary generics imports (#873)
-  Remove star imports (#865)
-  merger.py ➔ _merger.py (#864)
-  Type annotations for all functions/methods (#854)
-  Add initial type support with mypy (#853)

Testing (TST):
-  Regression test for xmp_metadata converter (#923)
-  Checkout submodule sample-files for benchmark
-  Add text extracting performance benchmark
-  Use new PyPDF2 API in benchmark (#902)
-  Make test suite fail for uncaught warnings (#892)
-  Remove -OO testrun from CI (#901)
-  Improve tests for convert_to_int (#899)

Full Changelog: 1.28.4...2.0.0
@pubpub-zz pubpub-zz deleted the iss301a branch August 8, 2022 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants