How should the final transformed parse structure look like? #31

schoettl · 2021-05-13T17:05:09Z

According to the worg spec, the (transformed) parse tree should look like this:

(document
 (section)
 (headline
  (section)
  (headline)
  (headline
   (headline))))

As for as I remember, this is different from the Organice parser:

The Organice parser keeps a flat list of headlines
The Organice parser does not allow a section (aka "content") above the first headline

I suggest that stick to the orgmode spec, i.e. allowing a section above the first headline and keeping a hierarchical structure of headlines. document would then be our S symbol.

That could be implemented in the transformers in PR #27 .

It will be more work later, to implement org-parser in Organice, but we get a general orgmode parser :-)

The text was updated successfully, but these errors were encountered:

schoettl · 2021-05-13T17:11:15Z

On the other hand, a flat list of headlines and sections seem to be very pragmatic. It makes it easier to change headline level and order.

[[:section]
 [:headline]
 [:section]
 [:headline]
 …
 ]

How to keep headline and section together here? This is not pragmatic. Section should belong to the headline.

How about this?

[[:section]
 [:headline
  [:section]]
 [:headline
  [:section]]
 …
 ]

branch14 · 2021-05-17T07:08:05Z

A couple of months ago, I've had at look with @munen at what organice expects as a data structure.

From what I recall and based on the discussion in #27 I want to suggest the following (at least for depth 1 and 2):

{;; "In-buffer Settings", see https://orgmode.org/manual/In_002dbuffer-Settings.html
 :settings ...
 ;; Let's call text before the first headline the preamble. As each headline introduces a
 ;; new section the content before the first headline is a section that does not belong
 ;; to any headline.
 :preamble
 {:section {:raw ...
            :ast ...}}
 ;; a flat list of headlines with their associated sections
 :headlines
 [{:headline {:level 1
              :title "hello world"
              ...}
   :section
   {:raw "this is the first section\nthis line has *bold text*\n"
    :ast [[:text [:text-normal "this is the first section"]]
          [:text
           [:text-normal "this line has "]
           [:text-styled
            [:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}
  ...]}

munen · 2021-05-17T08:43:12Z

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

schoettl · 2021-05-17T10:56:54Z

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

I agree. A flat list for headlines is fine. And having the same structure (:section) in the preamble is good.

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

branch14 · 2021-05-27T07:01:14Z

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

In order to preserve whitespace we should either (a) include whitespace in the parsed text or (b) retain whitespace in the AST as we do with empty lines. (a) is how it is currently done.

Example input: *bold text* text

Example ast (a): [:text [:text-sty-bold "bold text"] [:text-normal " text"]]

Example ast (b): [:text [:text-sty-bold "bold text"] [:whitespace " "] [:text-normal "text"]]

@schoettl Do you have examples for "discarding" whitespace?

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

schoettl · 2021-05-27T12:19:39Z

If you search in EBNF for regex \bs\b – that are examples where whitespace is parsed but not stored in the AST. It's mostly leading or trailing whitespace. Leading whitespace can often be discarded and re-computed for export/rendering. Trailing whitespace can often be omitted.

list-item-line has currently problems because leading whitespace is discarded but it's important in nested lists.
block-begin-line discards whitespace between the begin marker and arguments
headline discards whitespace inbetween its components
clock lines discard whitespace inbetween its components

I've go through it and I think that only the list-item-line must be fixed. For the rest, we can discard the whitespace or just re-compute them for export. Verbatim blocks are already parsed verbatim, i.e. not discarding trailing whitespace.

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

Maybe the instaparse meta information about position/span can still be used in the resulting transformed structure? Then we don't need any additional raw values and still have can provide all original information.

schoettl added the question Further information is requested label May 13, 2021

schoettl mentioned this issue May 14, 2021

[WIP] Parse content-line as text instead of just .* #27

Closed

4 tasks

schoettl added the documentation Improvements or additions to documentation label May 20, 2021

schoettl mentioned this issue May 23, 2021

[wip] Add: first draft of parsing basic structure elements #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should the final transformed parse structure look like? #31

How should the final transformed parse structure look like? #31

schoettl commented May 13, 2021

schoettl commented May 13, 2021

branch14 commented May 17, 2021 •

edited

Loading

munen commented May 17, 2021 •

edited

Loading

schoettl commented May 17, 2021

branch14 commented May 27, 2021

schoettl commented May 27, 2021

How should the final transformed parse structure look like? #31

How should the final transformed parse structure look like? #31

Comments

schoettl commented May 13, 2021

schoettl commented May 13, 2021

branch14 commented May 17, 2021 • edited Loading

munen commented May 17, 2021 • edited Loading

schoettl commented May 17, 2021

branch14 commented May 27, 2021

schoettl commented May 27, 2021

branch14 commented May 17, 2021 •

edited

Loading

munen commented May 17, 2021 •

edited

Loading