Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should the final transformed parse structure look like? #31

Open
schoettl opened this issue May 13, 2021 · 6 comments
Open

How should the final transformed parse structure look like? #31

schoettl opened this issue May 13, 2021 · 6 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@schoettl
Copy link
Collaborator

Hi @branch14 @munen ,

According to the worg spec, the (transformed) parse tree should look like this:

(document
 (section)
 (headline
  (section)
  (headline)
  (headline
   (headline))))

As for as I remember, this is different from the Organice parser:

  • The Organice parser keeps a flat list of headlines
  • The Organice parser does not allow a section (aka "content") above the first headline

I suggest that stick to the orgmode spec, i.e. allowing a section above the first headline and keeping a hierarchical structure of headlines. document would then be our S symbol.

That could be implemented in the transformers in PR #27 .

It will be more work later, to implement org-parser in Organice, but we get a general orgmode parser :-)

@schoettl schoettl added the question Further information is requested label May 13, 2021
@schoettl
Copy link
Collaborator Author

On the other hand, a flat list of headlines and sections seem to be very pragmatic. It makes it easier to change headline level and order.

[[:section]
 [:headline]
 [:section]
 [:headline]
 …
 ]

How to keep headline and section together here? This is not pragmatic. Section should belong to the headline.

How about this?

[[:section]
 [:headline
  [:section]]
 [:headline
  [:section]]
 …
 ]

@branch14
Copy link
Member

branch14 commented May 17, 2021

A couple of months ago, I've had at look with @munen at what organice expects as a data structure.

From what I recall and based on the discussion in #27 I want to suggest the following (at least for depth 1 and 2):

{;; "In-buffer Settings", see https://orgmode.org/manual/In_002dbuffer-Settings.html
 :settings ...
 ;; Let's call text before the first headline the preamble. As each headline introduces a
 ;; new section the content before the first headline is a section that does not belong
 ;; to any headline.
 :preamble
 {:section {:raw ...
            :ast ...}}
 ;; a flat list of headlines with their associated sections
 :headlines
 [{:headline {:level 1
              :title "hello world"
              ...}
   :section
   {:raw "this is the first section\nthis line has *bold text*\n"
    :ast [[:text [:text-normal "this is the first section"]]
          [:text
           [:text-normal "this line has "]
           [:text-styled
            [:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}
  ...]}

@munen
Copy link
Contributor

munen commented May 17, 2021

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

@schoettl
Copy link
Collaborator Author

branch14 and I just double checked this suggestion. It looks fine to me.

As for the 'hierarchical' vs 'nested' structure of headlines: We think that having a flat list is easier to work with for the consumer. For those who need/want a nested structure, transforming from flat to nested is a simple reduce, so it shouldn't make a big difference to what org-parser actually provides.

I agree. A flat list for headlines is fine. And having the same structure (:section) in the preamble is good.

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

@schoettl schoettl added the documentation Improvements or additions to documentation label May 20, 2021
@branch14
Copy link
Member

It's possible that, for some element transformations, we should keep the transformed "sub-ast" and a "sub-raw" form. Or maybe better a pair of indexes pointing to the position in the section raw string? Anyway, it might make sense in some cases, to allow re-export without discarding whitespace.

In order to preserve whitespace we should either (a) include whitespace in the parsed text or (b) retain whitespace in the AST as we do with empty lines. (a) is how it is currently done.

Example input: *bold text* text

Example ast (a): [:text [:text-sty-bold "bold text"] [:text-normal " text"]]

Example ast (b): [:text [:text-sty-bold "bold text"] [:whitespace " "] [:text-normal "text"]]

@schoettl Do you have examples for "discarding" whitespace?

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

@schoettl
Copy link
Collaborator Author

If you search in EBNF for regex \bs\b – that are examples where whitespace is parsed but not stored in the AST. It's mostly leading or trailing whitespace. Leading whitespace can often be discarded and re-computed for export/rendering. Trailing whitespace can often be omitted.

  • list-item-line has currently problems because leading whitespace is discarded but it's important in nested lists.
  • block-begin-line discards whitespace between the begin marker and arguments
  • headline discards whitespace inbetween its components
  • clock lines discard whitespace inbetween its components

I've go through it and I think that only the list-item-line must be fixed. For the rest, we can discard the whitespace or just re-compute them for export. Verbatim blocks are already parsed verbatim, i.e. not discarding trailing whitespace.

Passing raw for some elements is IMHO a convenience for consumers that cannot handle all elements, but it will be tricky to balance, as we cannot account for future use cases.

Maybe the instaparse meta information about position/span can still be used in the resulting transformed structure? Then we don't need any additional raw values and still have can provide all original information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants