Can't I split a two-column document using extract_words? #1247

waterfert · 2025-01-06T08:07:11Z

waterfert
Jan 6, 2025

Hello
I found out that if a pdf document consists of two columns, I can separate it into a text file using page.extract_text(layout=True)
.
However, in this case, I parsed the pdf document using the extract_words function and table.bbox, etc., because I also got the text of the table.
If I use extract_words, it gets the two-column document as one sentence.
So I thought I could use the deviation of the x0 value of the word between the columns. However, I found out that the x0 value of the word in the left and right columns is larger than the x0 deviation value between the columns.
When getting the text using extract_text, didn't it use the coordinate value of the word? Am I mistaken?

jsvine · 2025-01-10T04:02:06Z

jsvine
Jan 10, 2025
Maintainer

Unfortunately, I'm not sure I understand the question posed here. In particular, could you explain more about what you mean here?:

However, I found out that the x0 value of the word in the left and right columns is larger than the x0 deviation value between the columns.

0 replies

petermr · 2025-01-23T10:30:10Z

petermr
Jan 23, 2025

(please correct me if I've misunderstood).
I think the queationer wants PDFPlumber to automatically detect that this is a 2-column PDF and to extract the text in the order (left column as flowing text) - (right column as flowing text). In some PDFs this order is created by the PDF generator, but in others the text is output row by row (where row is constant Y). The output contains

..."buoyant", "levels," , "suppliers", "in" ...

in that order.
The only way that the words can by put in the correct reading order is by:

detecting 2-column format
creating bounding boxes for the column
processing them in reading order

I have frequently had to do this. It's heuristic. It's a similar problem to table detection and extraction. It is compounded if there are embedded tables without explicit boxes.

Lists are also a similar and difficult problem.

Extraction of text from images is very similar. (Coordinates, but no reading order).

I think this should be a tool downstream from PDFPlumber, because it could also be used for Images/hOCR format

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't I split a two-column document using extract_words? #1247

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Can't I split a two-column document using extract_words? #1247

waterfert Jan 6, 2025

Replies: 2 comments

jsvine Jan 10, 2025 Maintainer

petermr Jan 23, 2025

waterfert
Jan 6, 2025

jsvine
Jan 10, 2025
Maintainer

petermr
Jan 23, 2025