Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'PdfReadError: File has not been decrypted' for unencrypted file #991

Closed
MartinThoma opened this issue Jun 14, 2022 · 3 comments
Closed
Assignees
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jun 14, 2022

When trying to extract the text from a PDF, I get an exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.2.0

MCVE: Code and PDF

Using this PDF: https://corpora.tika.apache.org/base/docs/govdocs1/976/976028.pdf

from PyPDF2 import PdfReader
from tests import get_pdf_from_url
from io import BytesIO

reader = PdfReader(BytesIO(get_pdf_from_url("https://corpora.tika.apache.org/base/docs/govdocs1/976/976028.pdf", "tika-976028.pdf")))  # PdfReadWarning: incorrect startxref pointer(1)
reader.pages[0].extract_text()

I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 354, in _get_num_pages
    self.decrypt("")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1617, in decrypt
    return self._decrypt(password)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1657, in _decrypt
    raise NotImplementedError(
NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1462, in __getitem__
    len_self = len(self)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1453, in __len__
    return self.length_function()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 357, in _get_num_pages
    raise PdfReadError("File has not been decrypted")
PyPDF2.errors.PdfReadError: File has not been decrypted
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 14, 2022
@MartinThoma MartinThoma self-assigned this Jun 14, 2022
@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jun 14, 2022
@MartinThoma
Copy link
Member Author

Might be related to #416

@MartinThoma
Copy link
Member Author

Might change with #749

@MartinThoma
Copy link
Member Author

This issue no longer occurs 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

1 participant