Text extraction ignores different kinds of white spaces #107

hhaensel · 2022-10-17T15:20:38Z

Currently, all white space characters in a textbox are merged into a single space character (' ')
This makes it very difficult to extract tabular data.

In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.

:spaces (default)
all white spaces are handled as a single space character
:tabs
non-space white spaces are handled as tab characters
:boxes
text between non-space white spaces is split into several textboxes with respective coordinates

For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset.
During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters.
For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.

The :spaces mode reproduces the current extraction behavior.
The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character
The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.

@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction ignores different kinds of white spaces #107

Text extraction ignores different kinds of white spaces #107

hhaensel commented Oct 17, 2022

Text extraction ignores different kinds of white spaces #107

Text extraction ignores different kinds of white spaces #107

Comments

hhaensel commented Oct 17, 2022