Can't I split a two-column document using extract_words? #1247
Replies: 2 comments
-
Unfortunately, I'm not sure I understand the question posed here. In particular, could you explain more about what you mean here?:
|
Beta Was this translation helpful? Give feedback.
-
(please correct me if I've misunderstood).
in that order.
I have frequently had to do this. It's heuristic. It's a similar problem to table detection and extraction. It is compounded if there are embedded tables without explicit boxes. Lists are also a similar and difficult problem. Extraction of text from images is very similar. (Coordinates, but no reading order). I think this should be a tool downstream from PDFPlumber, because it could also be used for Images/hOCR format |
Beta Was this translation helpful? Give feedback.
-
Hello
I found out that if a pdf document consists of two columns, I can separate it into a text file using page.extract_text(layout=True)
.
However, in this case, I parsed the pdf document using the extract_words function and table.bbox, etc., because I also got the text of the table.
If I use extract_words, it gets the two-column document as one sentence.
So I thought I could use the deviation of the x0 value of the word between the columns. However, I found out that the x0 value of the word in the left and right columns is larger than the x0 deviation value between the columns.
When getting the text using extract_text, didn't it use the coordinate value of the word? Am I mistaken?
Beta Was this translation helpful? Give feedback.
All reactions