You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, all white space characters in a textbox are merged into a single space character (' ')
This makes it very difficult to extract tabular data.
In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.
:spaces (default)
all white spaces are handled as a single space character
:tabs
non-space white spaces are handled as tab characters
:boxes
text between non-space white spaces is split into several textboxes with respective coordinates
For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset.
During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters.
For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.
The :spaces mode reproduces the current extraction behavior.
The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character
The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.
@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.
The text was updated successfully, but these errors were encountered:
Currently, all white space characters in a textbox are merged into a single space character (
' '
)This makes it very difficult to extract tabular data.
In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.
:spaces
(default)all white spaces are handled as a single space character
:tabs
non-space white spaces are handled as tab characters
:boxes
text between non-space white spaces is split into several textboxes with respective coordinates
For this purpose
get_TextBox()
no longer returns a tupletext, w, h
but a vector of tuplestext, w, h, offset
.During
evalContent!()
the vector is itereated to return aTextLayout
for each set of box parameters.For the modes
:spaces
and:tabs
get_TextBox()always returns a single-element vector, whereas in
:boxes` mode more than one TextLayout might be added to the output.The
:spaces
mode reproduces the current extraction behavior.The
:tab
mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space characterThe
:boxes
mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.
The text was updated successfully, but these errors were encountered: