-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROB : cope with invalid length in streams #861
Changes from all commits
e056e46
0bc9023
f4d86b9
50d091a
9b5e25d
7ed2708
116f631
30211d9
9c0b549
d22c0d6
d4932a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -578,6 +578,29 @@ def writeToStream(self, stream, encryption_key): | |
|
||
@staticmethod | ||
def readFromStream(stream, pdf): | ||
def getNextObjPos(p, p1, remGens, pdf): | ||
l = pdf.xref[remGens[0]] | ||
for o in l: | ||
if p1 > l[o] and p < l[o]: | ||
p1 = l[o] | ||
if len(remGens) == 1: | ||
return p1 | ||
else: | ||
return getNextObjPos(p, p1, remGens[1:], pdf) | ||
|
||
def readUnsizedFromSteam(stream, pdf): | ||
# we are just pointing at beginning of the stream | ||
eon = getNextObjPos(stream.tell(), 2**32, [g for g in pdf.xref], pdf) - 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does "eon" mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. end before next object : we have to find when to stop looking for endstream |
||
curr = stream.tell() | ||
rw = stream.read(eon - stream.tell()) | ||
p = rw.find(b_("endstream")) | ||
if p < 0: | ||
raise PdfReadError( | ||
f"Unable to find 'endstream' marker for obj starting at {curr}." | ||
) | ||
stream.seek(curr + p + 9) | ||
return rw[: p - 1] | ||
|
||
tmp = stream.read(2) | ||
if tmp != b_("<<"): | ||
raise PdfReadError( | ||
|
@@ -641,6 +664,7 @@ def readFromStream(stream, pdf): | |
t = stream.tell() | ||
length = pdf.getObject(length) | ||
stream.seek(t, 0) | ||
pstart = stream.tell() | ||
data["__streamdata__"] = stream.read(length) | ||
e = readNonWhitespace(stream) | ||
ndstream = stream.read(8) | ||
|
@@ -657,6 +681,10 @@ def readFromStream(stream, pdf): | |
if end == b_("endstream"): | ||
# we found it by looking back one character further. | ||
data["__streamdata__"] = data["__streamdata__"][:-1] | ||
elif not pdf.strict: | ||
stream.seek(pstart, 0) | ||
data["__streamdata__"] = readUnsizedFromSteam(stream, pdf) | ||
pos = stream.tell() | ||
Comment on lines
+684
to
+687
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me check my understanding: If this part is reached, then we are in a situation in which the "endstream" marker was not found where it should be. As we are in best effort mode, we go back to the very beginning of the stream. From there, we read as much as necessary. Right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we read up to the endstream tag |
||
else: | ||
stream.seek(pos, 0) | ||
raise PdfReadError( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this function. How does it know where the ream ends? Couldn't there be multiple objects in a stream?
Can you point me to a resource where I can read up on PDF streams in general? I would appreciate that a lot 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my analysis/tests, the stream were most of the time compressed/encoded
Also we have to remember that this function will only be used to load some data when the PDF file is corrupted.
I peek up some information in the standard
https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
and also in those:
https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.html
for ref also so test files
https://github.com/pdf-association/pdf20examples