-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROB : cope with invalid length in streams #861
Conversation
Most importantly, the means Python 2.7 no longer needs to get supported
* All of them are removed from the package distributions * Scripts is additionally moved to the cpdf project * Sample_Code is moved to the docs
PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings`. The new behavior is `overwriteWarnings=False`. Additionally, PyPDF2.utils.formatWarning was removed
…#848) It's expected that this is a more sensible default for most users.
As support for Python 3.5 and lower was dropped, we can use more modern syntax
This includes adding a type m
issue py-pdf#301 in case of invalid extract stream data looking for endstream
Codecov Report
@@ Coverage Diff @@
## 2.0.0-dev #861 +/- ##
=============================================
+ Coverage 82.25% 82.38% +0.13%
=============================================
Files 15 15
Lines 3640 3662 +22
Branches 781 787 +6
=============================================
+ Hits 2994 3017 +23
+ Misses 477 476 -1
Partials 169 169
Continue to review full report at Codecov.
|
|
||
def readUnsizedFromSteam(stream, pdf): | ||
# we are just pointing at beginning of the stream | ||
eon = getNextObjPos(stream.tell(), 2**32, [g for g in pdf.xref], pdf) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "eon" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end before next object : we have to find when to stop looking for endstream
elif not pdf.strict: | ||
stream.seek(pstart, 0) | ||
data["__streamdata__"] = readUnsizedFromSteam(stream, pdf) | ||
pos = stream.tell() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check my understanding: If this part is reached, then we are in a situation in which the "endstream" marker was not found where it should be. As we are in best effort mode, we go back to the very beginning of the stream. From there, we read as much as necessary. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we read up to the endstream tag
else: | ||
return getNextObjPos(p, p1, remGens[1:], pdf) | ||
|
||
def readUnsizedFromSteam(stream, pdf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this function. How does it know where the ream ends? Couldn't there be multiple objects in a stream?
Can you point me to a resource where I can read up on PDF streams in general? I would appreciate that a lot 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my analysis/tests, the stream were most of the time compressed/encoded
Also we have to remember that this function will only be used to load some data when the PDF file is corrupted.
I peek up some information in the standard
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
and also in those:
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.html
for ref also so test files
https://github.com/pdf-association/pdf20examples
You're on fire! Pretty nice! I'll try to understand + mehr it to the 2.0 version tomorrow (edit: 2.0 I mean) |
The 2.0.0 release of PyPDF2 includes three core changes: 1. Dropping support for Python 3.5 and older. 2. Introducing type annotations. 3. Interface changes, mostly to have PEP8-compliant names We introduced a [deprecation process](#930) that hopefully helps users to avoid unexpected breaking changes. Breaking Changes(DEP): - PyPDF2 2.0 requires Python 3.6+. Python 2.7 and 3.5 support were dropped. - PdfFileReader: The "warndest" parameter was removed - PdfFileReader and PdfFileMerger no longer have the `overwriteWarnings` parameter. The new behavior is `overwriteWarnings=False`. - merger: OutlinesObject was removed without replacement. - merger.py ➔ _merger.py: You must import PdfFileMerger from PyPDF2 directly. - utils: * `ConvertFunctionsToVirtualList` was removed * `formatWarning` was removed * `isInt(obj)`: Use `instance(obj, int)` instead * `u_(s)`: Use `s` directly * `chr_(c)`: Use `chr(c)` instead * `barray(b)`: Use `bytearray(b)` instead * `isBytes(b)`: Use `instance(b, type(bytes()))` instead * `xrange_fn`: Use `range` instead * `string_type`: Use `str` instead * `isString(s)`: Use `instance(s, str)` instead * `_basestring`: Use `str` instead * All Exceptions are now in `PyPDF2.errors`: - PageSizeNotDefinedError - PdfReadError - PdfReadWarning - PyPdfError - `PyPDF2.pdf` (the `pdf` module) no longer exists. The contents were moved with the library. You should most likely import directly from `PyPDF2` instead. The `RectangleObject` is in `PyPDF2.generic`. - The `Resources`, `Scripts`, and `Tests` will no longer be part of the distribution files on PyPI. This should have little to no impact on most people. The `Tests` are renamed to `tests`, the `Resources` are renamed to `resources`. Both are still in the git repository. The `Scripts` are now in https://github.com/py-pdf/cpdf. `Sample_Code` was moved to the `docs`. For a full list of deprecated functions, please see the changelog of version 1.28.0. New Features (ENH): - Improve space setting for text extraction (#922) - Allow setting the decryption password in PdfReader.__init__ (#920) - Add Page.add_transformation (#883) Bug Fixes (BUG): - Fix error adding transformation to page without /Contents (#908) Robustness (ROB): - Cope with invalid length in streams (#861) Documentation (DOC): - Fix style of 1.25 and 1.27 patch notes (#927) - Transformation (#907) Developer Experience (DEV): - Create flake8 config file (#916) - Use relative imports (#875) Maintenance (MAINT): - Use Python 3.6 language features (#849) - Add wrapper function for PendingDeprecationWarnings (#928) - Use new PEP8 compliant names (#884) - Explicitly represent transformation matrix (#878) - Inline PAGE_RANGE_HELP string (#874) - Remove unnecessary generics imports (#873) - Remove star imports (#865) - merger.py ➔ _merger.py (#864) - Type annotations for all functions/methods (#854) - Add initial type support with mypy (#853) Testing (TST): - Regression test for xmp_metadata converter (#923) - Checkout submodule sample-files for benchmark - Add text extracting performance benchmark - Use new PyPDF2 API in benchmark (#902) - Make test suite fail for uncaught warnings (#892) - Remove -OO testrun from CI (#901) - Improve tests for convert_to_int (#899) Full Changelog: 1.28.4...2.0.0
issue #301
in case of invalid extract stream data looking for endstream