Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf set to None followed by an attribute check on pdf #1107

Closed
jlshin opened this issue Jul 14, 2022 · 1 comment · Fixed by #1113
Closed

pdf set to None followed by an attribute check on pdf #1107

jlshin opened this issue Jul 14, 2022 · 1 comment · Fixed by #1113

Comments

@jlshin
Copy link
Contributor

jlshin commented Jul 14, 2022

Environment

Python 3.8.13 with PyPDF2==2.5.0

reader = PyPDF2.PdfFileReader(file)
number_of_pages = reader.numPages
for page_number in range(0, number_of_pages):
    page = reader.getPage(page_number)
    page_content = page.extractText()

Traceback

    page_content = page.extractText()
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1340, in extractText
    return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1317, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1139, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1196, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1226, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1329, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic.py", line 808, in read_from_stream
    if pdf.strict:
AttributeError: 'NoneType' object has no attribute 'strict'

I cannot attach the PDF I am using, but I can explain what I think the bug is:

In generic::_parse_content_stream I hit the following condition, which sets the pdf to None:
https://github.com/py-pdf/PyPDF2/blob/1e4c2c9b4649449241b0ae166e7e90f6bc61596d/PyPDF2/generic.py#L1226

So by the time we get to:
https://github.com/py-pdf/PyPDF2/blob/1e4c2c9b4649449241b0ae166e7e90f6bc61596d/PyPDF2/generic.py#L808-L811

The above error is raised because pdf is None and has no attribute strict

I have gotten around it by modifying line 808 to

if pdf is not None and pdf.strict:
@jlshin jlshin changed the title pdf set to None following by an attribute check on pdf pdf set to None followed by an attribute check on pdf Jul 14, 2022
@MartinThoma
Copy link
Member

As a quick fix this makes sense. I'm not sure how to make a clean solution.

Do you want to open a pr with your fix?

jlshin added a commit to jlshin/PyPDF2 that referenced this issue Jul 14, 2022
MartinThoma pushed a commit that referenced this issue Jul 15, 2022
Guard pdf.strict with check if pdf is None in DictionaryObject.read_from_stream

Closes #1107
mtd91429 pushed a commit to mtd91429/PyPDF2 that referenced this issue Jul 15, 2022
Guard pdf.strict with check if pdf is None in DictionaryObject.read_from_stream

Closes py-pdf#1107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants