Design proposal for inline text components #894
dolfim-ibm
started this conversation in
Ideas
Replies: 2 comments
-
Just for context, in MS Word, the concepts have the following terminology: One paragraph has one or multiple runs, each run is a continuous subsequence in the paragraph with a single style. We may want to reflect that in the |
Beta Was this translation helpful? Give feedback.
0 replies
-
Inline groups have meanwhile been introduced with DS4SD/docling-core#156. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Which problem do we want to solve?
Inline text components (inline formulas, inline code, text styling, etc) are not yet well defined in the
DoclingDocument
.For example, inline formulas are often broken by the text cleanup methods (e.g. escaping underscores
_
and more).In general having knowledge of the inline text components allow to apply specialized export functions which will take care of the specific details.
Proposed approach
The proposed approach is to break the text cluster into smaller split items with homogeneous style. See the following sketch.
The single
DocItem
(e.g. of Text label) will be split into 5 items, where the green one will carry the other label, e.g. Formula.By using a new GroupItem around these items will signal the processing code that they must be treated as inline items, e.g. to be merged without new lines.
What about the
prov
field?All the split items will have the same provenance bbox. Potentially we could exploit the charspan to differentiate them.
Do we need new labels?
For code and formula we don't need extra labels. The different styling (e.g. bold, italic, etc) will need new labels, but we can introduce them in a later step.
Beta Was this translation helpful? Give feedback.
All reactions