Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text(layout=True) fails if PDF page contains no text #658

Closed
ethanscorey opened this issue May 27, 2022 · 1 comment
Closed

extract_text(layout=True) fails if PDF page contains no text #658

ethanscorey opened this issue May 27, 2022 · 1 comment
Labels

Comments

@ethanscorey
Copy link

Describe the bug

When extracting text from a PDF page that contains no text, Page.extract_text typically returns an empty string. However, if it's run with the keyword argument layout=True, I get an IndexError.

Code to reproduce the problem

with pdfplumber.open("test_blank_pdf.pdf") as pdf:
    print(pdf.pages[0].extract_text(layout=True))

PDF file

This error seems to occur with any PDF page that doesn't contain any text, so any text-less PDF file will do.

Expected behavior

Page.extract_text should return an empty string if the page contains no text, regardless of whether the layout keyword argument is True or False.

Actual behavior

Without layout=True, you get an empty string; with layout=True, you get an IndexError.

Environment

  • pdfplumber version: [0.6.1]
  • Python version: [3.10.4]
  • OS: [Ubuntu 20.04 .4 on Windows Subsystem for Linux]

Additional context

This seems to be the relevant part of the traceback:

File ~/miniconda3/envs/foo/lib/python3.10/site-packages/pdfplumber/utils.py:430, in words_to_layout(words, x_density, y_density, x_shift, y_shift, y_tolerance, presorted)
    428 rendered = ""
    429 words_sorted = words if presorted else sorted(words, key=itemgetter("doctop", "x0"))
--> 430 doctop_start = words_sorted[0]["doctop"] - words_sorted[0]["top"]
    431 for ws in cluster_objects(words_sorted, "doctop", y_tolerance):
    432     y_dist = (ws[0]["doctop"] - (doctop_start + y_shift)) / y_density

IndexError: list index out of range

I think adding a check of whether the words list in words_to_layout contains any elements should fix the error. Happy to do a pull request if you think the solution makes sense:

rendered = ""
if not words:
    return rendered
words_sorted = words if presorted else sorted(words, key=itemgetter("doctop", "x0"))
...
jsvine added a commit that referenced this issue May 27, 2022
@jsvine
Copy link
Owner

jsvine commented May 27, 2022

Great catch! Thank you, @ethanscorey. Now fixed on develop, including a new test. (Your suggestion was a good one, but the relevant code has changed a bit since then.) Should be available in a new release soon.

@jsvine jsvine closed this as completed May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants