Unknown widths when reading PDFs? #1714

code-me-seymour · 2023-03-15T13:55:10Z

code-me-seymour
Mar 15, 2023

Hi all, I'm trying to write a script that will let me find every instance of a term (from a list of terms) in each PDF in a directory. To do this, I'm using the pypdf, re, os, and glob packages, co-opting code from this post. The code (below) needs some refining so that it outputs something I can actually use, but it is otherwise working as intended. However, when it reaches one PDF, it prints out a series of messages like the following:

unknown widths : [0, IndirectObject(261, 0, 2529565096976)]

I've looked through the pypdf/pypdf2 documentation, stack overflow, and this github for details on what this means, but haven't found a clear answer. When I used print(file), the terminal spat out most of the PDF, interspersed with the above messages. I should also note that the PDF in question is computer-generated (i.e., I can copy text from it), is only 35 pages long, and does not contain my search terms.

Can anyone help me understand what this message means? Ultimately I'd like to take the data this code generates and use it to build a dataframe, and I suspect this message may interfere with that. Thanks!

from pypdf import PdfReader
import re
import os
import time
import glob

start = time.time()
directory = os.path.dirname(os.path.abspath(__file__))
pdf_filepaths = glob.glob('C:\\my_filepath', recursive=True)
results = {}

for filepath in pdf_filepaths:
    filename = os.path.basename(filepath)
    found_terms = {}
    file = open(filepath, "rb")
    reader = PdfReader(file)
    search_terms = ['term1', 'term2', 'term3']
    for i in range(len(reader.pages)):
        page = reader.pages[i]
        page_content = page.extract_text()
        for term in search_terms:
            if re.search(term.lower(), page_content.lower()):
                print(f"Matched '{term}' on page{i} in document {filename} in {filepath}")        
           
print(f"Program took {time.time() - start} seconds.")

pubpub-zz · 2023-03-15T18:29:29Z

pubpub-zz
Mar 15, 2023
Maintainer

Can you share the file where you are facing this issue please

1 reply

code-me-seymour Mar 15, 2023
Author

Sure, here's the offending document:
Ballinasloe_WS.pdf

MartinThoma · 2023-03-15T20:58:05Z

MartinThoma
Mar 15, 2023
Maintainer

In case you just want to get rid of those messages https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html

0 replies

pubpub-zz · 2023-03-15T21:23:15Z

pubpub-zz
Mar 15, 2023
Maintainer

Thanks for reporting
Actually this is actually a small problem. I've converted this thread into an issue (#1718) for traceability and will close this discussion

You may have a look at the PR you may be able to apply the mod as a patch if you need to go quickly

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown widths when reading PDFs? #1714

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unknown widths when reading PDFs? #1714

code-me-seymour Mar 15, 2023

Replies: 3 comments · 1 reply

pubpub-zz Mar 15, 2023 Maintainer

code-me-seymour Mar 15, 2023 Author

MartinThoma Mar 15, 2023 Maintainer

pubpub-zz Mar 15, 2023 Maintainer

code-me-seymour
Mar 15, 2023

Replies: 3 comments 1 reply

pubpub-zz
Mar 15, 2023
Maintainer

code-me-seymour Mar 15, 2023
Author

MartinThoma
Mar 15, 2023
Maintainer

pubpub-zz
Mar 15, 2023
Maintainer