Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

Closed
MartinThoma opened this issue Jun 6, 2022 · 5 comments
Assignees
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

When I try to extrac the text from the PDF below, I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1263, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1245, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1197, in process_operation
    text += operands[0].translate(cmap)
TypeError: a bytes-like object is required, not 'dict'

Fixing this issue would likely also fix #523

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-debian-bullseye-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.0.0 (current main - 2.1.0)

Code

PDF: https://github.com/mstamy2/PyPDF2/files/3796761/17343_2008_Order_09-Jan-2019.pdf

from PyPDF2 import PdfReader

reader = PdfReader('17343_2008_Order_09-Jan-2019.pdf')
page = reader.pages[0]
page.extract_text()
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 6, 2022
@MartinThoma MartinThoma self-assigned this Jun 6, 2022
@MartinThoma
Copy link
Member Author

@pubpub-zz Did I mess up something in my refactoring of the code you've contributed?

@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jun 6, 2022
@pubpub-zz
Copy link
Collaborator

@pubpub-zz
Copy link
Collaborator

I found the issue. fix is under finalisation.

@MartinThoma
Copy link
Member Author

I was just pointed to another example where this issue occurs: seleniumbase/SeleniumBase#431 (comment)

@MartinThoma
Copy link
Member Author

This issue is fixed in PyPDF2==2.2.0 via #969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants