process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

MartinThoma · 2022-06-06T12:24:44Z

When I try to extrac the text from the PDF below, I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1263, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1245, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1197, in process_operation
    text += operands[0].translate(cmap)
TypeError: a bytes-like object is required, not 'dict'

Fixing this issue would likely also fix #523

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-113-generic-x86_64-with-debian-bullseye-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.0.0 (current main - 2.1.0)

Code

PDF: https://github.com/mstamy2/PyPDF2/files/3796761/17343_2008_Order_09-Jan-2019.pdf

from PyPDF2 import PdfReader

reader = PdfReader('17343_2008_Order_09-Jan-2019.pdf')
page = reader.pages[0]
page.extract_text()

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-06-06T12:25:42Z

@pubpub-zz Did I mess up something in my refactoring of the code you've contributed?

pubpub-zz · 2022-06-08T20:03:36Z

note: the PDF content is publically available here:
https://indiankanoon.org/doc/184987075/?__cf_chl_tk=EllJYQaV.OcCEE.yZ_47lZ7mAOutPkoK5LWuA4IGtSY-1654718391-0-gaNycGzNCFE

pubpub-zz · 2022-06-08T20:04:05Z

I found the issue. fix is under finalisation.

MartinThoma · 2022-06-09T19:38:33Z

I was just pointed to another example where this issue occurs: seleniumbase/SeleniumBase#431 (comment)

MartinThoma · 2022-06-16T10:14:18Z

This issue is fixed in PyPDF2==2.2.0 via #969

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 6, 2022

MartinThoma self-assigned this Jun 6, 2022

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jun 6, 2022

pubpub-zz mentioned this issue Jun 10, 2022

improved ExtractText(3) #969

Merged

MartinThoma closed this as completed Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

MartinThoma commented Jun 6, 2022

MartinThoma commented Jun 6, 2022

pubpub-zz commented Jun 8, 2022

pubpub-zz commented Jun 8, 2022

MartinThoma commented Jun 9, 2022

MartinThoma commented Jun 16, 2022

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

Comments

MartinThoma commented Jun 6, 2022

Environment

Code

MartinThoma commented Jun 6, 2022

pubpub-zz commented Jun 8, 2022

pubpub-zz commented Jun 8, 2022

MartinThoma commented Jun 9, 2022

MartinThoma commented Jun 16, 2022