`extract_text(layout=True)` fails if PDF page contains no text #658

ethanscorey · 2022-05-27T03:32:48Z

Describe the bug

When extracting text from a PDF page that contains no text, Page.extract_text typically returns an empty string. However, if it's run with the keyword argument layout=True, I get an IndexError.

Code to reproduce the problem

with pdfplumber.open("test_blank_pdf.pdf") as pdf:
    print(pdf.pages[0].extract_text(layout=True))

PDF file

This error seems to occur with any PDF page that doesn't contain any text, so any text-less PDF file will do.

Expected behavior

Page.extract_text should return an empty string if the page contains no text, regardless of whether the layout keyword argument is True or False.

Actual behavior

Without layout=True, you get an empty string; with layout=True, you get an IndexError.

Environment

pdfplumber version: [0.6.1]
Python version: [3.10.4]
OS: [Ubuntu 20.04 .4 on Windows Subsystem for Linux]

Additional context

This seems to be the relevant part of the traceback:

File ~/miniconda3/envs/foo/lib/python3.10/site-packages/pdfplumber/utils.py:430, in words_to_layout(words, x_density, y_density, x_shift, y_shift, y_tolerance, presorted)
    428 rendered = ""
    429 words_sorted = words if presorted else sorted(words, key=itemgetter("doctop", "x0"))
--> 430 doctop_start = words_sorted[0]["doctop"] - words_sorted[0]["top"]
    431 for ws in cluster_objects(words_sorted, "doctop", y_tolerance):
    432     y_dist = (ws[0]["doctop"] - (doctop_start + y_shift)) / y_density

IndexError: list index out of range

I think adding a check of whether the words list in words_to_layout contains any elements should fix the error. Happy to do a pull request if you think the solution makes sense:

rendered = ""
if not words:
    return rendered
words_sorted = words if presorted else sorted(words, key=itemgetter("doctop", "x0"))
...

The text was updated successfully, but these errors were encountered:

@ethanscorey

Fixes #658 Thanks to @ethanscorey for flagging!

jsvine · 2022-05-27T18:44:40Z

Great catch! Thank you, @ethanscorey. Now fixed on develop, including a new test. (Your suggestion was a good one, but the relevant code has changed a bit since then.) Should be available in a new release soon.

ethanscorey added the bug label May 27, 2022

jsvine added a commit that referenced this issue May 27, 2022

Fix .extract_text(layout=True) for text-less pages

ad3df11

Fixes #658 Thanks to @ethanscorey for flagging!

jsvine closed this as completed May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`extract_text(layout=True)` fails if PDF page contains no text #658

`extract_text(layout=True)` fails if PDF page contains no text #658

ethanscorey commented May 27, 2022

jsvine commented May 27, 2022

extract_text(layout=True) fails if PDF page contains no text #658

extract_text(layout=True) fails if PDF page contains no text #658

Comments

ethanscorey commented May 27, 2022

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Environment

Additional context

jsvine commented May 27, 2022

`extract_text(layout=True)` fails if PDF page contains no text #658

`extract_text(layout=True)` fails if PDF page contains no text #658