PyPDF2 throws exception during extract_text() #1533

lenemeth · 2023-01-06T11:00:53Z

I'm working on a script that is parsing PDF invoices and I'm getting exception during pdf reading. This happens only with a specific type of PDF coming from a tapwater utility service provider company. However, all PDFs from them are failed to be parsed with the same error.

Environment

Windows 10

c:\>python --version
Python 3.11.1

c:\>pip show pyPdf2
Name: PyPDF2
Version: 3.0.1
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page:
Author:
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License:
Location: C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires:
Required-by:

Code + PDF

from PyPDF2 import PdfReader
reader = PdfReader(filePath)

for page in reader.pages:
     text = page.extract_text()

I can share the PDF in email as it contains personal data (invoice). Let me know where to send it

Traceback

Traceback (most recent call last):
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 63, in <module>
    em.parse_invoices()
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\EstateManager.py", line 22, in parse_invoices
    self.ip.parse_invoices(self.config['input_data']['invoices']['directory_path'])
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 47, in parse_invoices
    self.extract_pdf(os.path.join(directory, file))
  File "C:\Users\lenemeth\Documents\AOB\doc\Erőmű 8\albérlet\szamla_parser\InvoiceParser.py", line 63, in extract_pdf
    text = page.extract_text()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
                                             ^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lenemeth\AppData\Local\Programs\Python\Python311\Lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
                               ~~~^^^
IndexError: list index out of range

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-01-06T17:59:02Z

At first glance, Looks like a duplicate of #1091
A PR and a fix is proposed can you try it

lenemeth · 2023-01-06T19:38:47Z

Thanks! I've tried this one and it seems to be working. However now there is an another issue: the returned text charset seems to be messed up a bit as Hungarian letters (iso-8859-2 / "Latin-2") are unreadable:

I got this: sz♥mlakibocs♥t♦hoz t♣rt☺n☻ regisztr♥ci♦
Should look like this: számlakibocsátóhoz történő regisztráció

Not sure if it's because of this particular PDF type but the rest of the invoices using similar alphapet looks fine :)

pubpub-zz · 2023-01-08T08:37:46Z

@lenemeth can you provide your pdf please for review

First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress

lenemeth · 2023-01-08T16:13:36Z

@lenemeth can you provide your pdf please for review

@pubpub-zz please provide an email address so that I can send it. It contains personal data (invoice) so I don't want to publicly share it. Thanks for your understanding.

MartinThoma · 2023-01-08T16:30:18Z

@lenemeth I know that @pubpub-zz values privacy and I could imagine that he wants to keep his email address private. If you want, you can send it to me and I can forward it: info@martin-thoma.de

lenemeth · 2023-01-09T11:42:32Z

@MartinThoma sent via email. Please share with @pubpub-zz privately.

MartinThoma · 2023-01-09T17:22:17Z

I did. Thanks for sharing :-)

error with multiple lines

pubpub-zz · 2023-01-09T18:30:52Z

@lenemeth,
thanks for your contribution the extraction was buggy with unicode cmap where ranges were set on multiple lines.

Can you check that the PR is now good for you. I will add a test for coverage

pubpub-zz · 2023-01-09T19:11:54Z

test file for test coverage
iss1533.pdf

lenemeth · 2023-01-09T19:59:14Z

@pubpub-zz I've checked with all of my invoice types and works well. Thanks for the correction!

MartinThoma · 2023-01-09T20:45:39Z

Thank you for confirming that it works and thank you for sharing the PDF for investigation. We will close this issue once the PR is merged :-) I guess we will have a fixed version on PyPI on Sunday.

@pubpub-zz Thank you so much for taking care of this again 🙏

Fixes #1533 and late #1091

gzeng11 · 2023-07-29T02:09:29Z

I have tried to use PyPDF2 to chat with PDF with OpenAI and Langchian. For any PDF files which cannot be copied, it will throw "IndexError: list index out of range. "

If I run the following code:

from PyPDF2 import PdfReader

reader = PdfReader(filePath)

for page in reader.pages:
text = page.extract_text()
print(text)

For this type of PDF files, it will print nothing.

Thanks.

Guoping

MartinThoma · 2023-07-29T05:33:13Z

PyPDF2 is deprecated. Use pypdf.

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jan 6, 2023

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 8, 2023

FIX : fixes indexerror in cmap

395f588

First Part fixing py-pdf#1091 (late) Analysis of 'Hungarian' py-pdf#1533 still in progress

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 9, 2023

fix py-pdf#1533

ea2aec5

error with multiple lines

pubpub-zz mentioned this issue Jan 9, 2023

BUG: Fix error in cmap extraction #1544

Merged

lenemeth closed this as completed Jan 9, 2023

MartinThoma reopened this Jan 9, 2023

MartinThoma closed this as completed in #1544 Jan 21, 2023

MartinThoma pushed a commit that referenced this issue Jan 21, 2023

BUG: Fix error in cmap extraction (#1544)

c1f8742

Fixes #1533 and late #1091

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDF2 throws exception during extract_text() #1533

PyPDF2 throws exception during extract_text() #1533

lenemeth commented Jan 6, 2023 •

edited by MartinThoma

Loading

pubpub-zz commented Jan 6, 2023 •

edited

Loading

lenemeth commented Jan 6, 2023

pubpub-zz commented Jan 8, 2023

lenemeth commented Jan 8, 2023

MartinThoma commented Jan 8, 2023

lenemeth commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

pubpub-zz commented Jan 9, 2023

pubpub-zz commented Jan 9, 2023

lenemeth commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

gzeng11 commented Jul 29, 2023

MartinThoma commented Jul 29, 2023

PyPDF2 throws exception during extract_text() #1533

PyPDF2 throws exception during extract_text() #1533

Comments

lenemeth commented Jan 6, 2023 • edited by MartinThoma Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Jan 6, 2023 • edited Loading

lenemeth commented Jan 6, 2023

pubpub-zz commented Jan 8, 2023

lenemeth commented Jan 8, 2023

MartinThoma commented Jan 8, 2023

lenemeth commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

pubpub-zz commented Jan 9, 2023

pubpub-zz commented Jan 9, 2023

lenemeth commented Jan 9, 2023

MartinThoma commented Jan 9, 2023

gzeng11 commented Jul 29, 2023

MartinThoma commented Jul 29, 2023

lenemeth commented Jan 6, 2023 •

edited by MartinThoma

Loading

pubpub-zz commented Jan 6, 2023 •

edited

Loading