[wip] Add: first draft of parsing basic structure elements #7

gcentauri · 2020-02-10T02:54:26Z

I realized that the existing org.ebnf file is just aiming at parsing individual lines. I'm sure this will be useful, but as I started looking at property drawers and decided to try working top-down based on the org specification.

This is just a start, but is one step on the way of building the tree structure. Currently, it does not properly switch heading levels while reading the tree. I kept everything in separate files for the time being, as it was easier for me to deal with while learning how everything works.

munen · 2020-02-10T05:22:18Z

Looks like a great start! 👍🙏🏻

gcentauri · 2020-02-18T19:46:14Z

I haven't had a lot of time to work on this recently, but I was getting stuck on how to properly define the nested structure of an org file with instaparse. Should it be possible to describe this with EBNF? I was assuming it can be due to this part of the syntax specification:

A core concept in this syntax is that only headlines, sections, planning lines and property drawers are context-free. Every other syntactical part only exists within specific environments.

But that may assume we just take the headlines and count stars, and perhaps process the data structure into a tree after parsing the document. I was having trouble figuring out how to determine if we're jumping up just one level when the next headline comes or if we're coming out multiple levels of nesting, for example:

* One
** Two
*** Three
* One, Again

My original attempt worked to go from one to two and back, but adding a third level subheading showed me it was naive. The parser just goes up one level.

I don't want to invest too much time into it if it isn't going to work, so if anyone here has thoughts I'd love to hear them :) Thank you!

schoettl · 2020-04-17T20:29:51Z

Hi @gcentauri ,
I did not look into your PR in a great detail. But I think, the missing link is the transform function from instaparse.

http://xahlee.info/clojure/clojure_instaparse.html -> Function: transform

Haven't tried it yet, but it seems like the map argument to this function is the place where we transform the very basic parsed structure to a higher-level structure.

E.g. from

[:headline [:stars "**"] [:title "test"]]

to

{title: "test", level: 2} // JS hash, I don't know yet the corresponding clojure syntax ^^

Currently I'm working on the timestamps PR. I probably need to transform them to a higher-level structure, too.

gcentauri · 2020-04-20T18:00:26Z

@schoettl - thanks for the insight! i was getting that feeling too, i'm just very new to parsing stuff.

so it seems like indeed the line-based approach might be the first pass, and then we have another pass to take the structure generated by that and turn it into the proper tree structure?

i'd like to get back to this soon. i just felt stuck.

schoettl · 2020-04-20T18:10:23Z

I think your right, with that second transforming pass on the parse tree. I opened #8 to discuss in how far this transformation is in scope of this project.

Anyway, I think the plain parsing to a flat list of headers is a very important first step!

branch14 · 2020-05-13T07:45:34Z

@gcentauri, @schoettl First of all let me thank your for you interest and work in this project. Your thoughts and efforts are greatly appreciated!

In my first attempts I actually did follow the idea to identify semantic blocks rather than "only" lines. But it turned out to get tricky rather quickly. While I didn't encounter any formal reason not to continue with semantic blocks, I felt that it would make it really hard for others to contribute. Hence I decided to proceed with the much simpler line based approach (or as I put it in 4a4563f "the sane way"). Org-mode is a line based format where greater blocks and other semantic units are made up of lines after all. I expect following that observation for building a parser will keep things simple.

As the parse tree that results from a line based approach does not yield the data structure that resembles the document nicely (i.e. is the structure one would like to work with) a 2nd step "transform" will be required, much like @schoettl pointed out.

While Instaparse's transform function is nice, I don't feel there is much gain in using it on a line based parse tree. Instead a couple of classical map and reduce should just do the job. I'll be happy to provide an example how to do that.

Having said that and looking at the progress you made with #11, I'm totally open to other approaches. The PR reminds me of my attempts just before I gave up on the idea to have the grammar do the heavy lifting and decided to go with the simpler line based and a subsequent transformation. So I wonder where you're at.

branch14 · 2020-05-13T10:09:48Z

Here I layed out how the code for transformation could look like: #15

schoettl · 2020-05-13T13:31:23Z

Having said that and looking at the progress you made with #11, I'm totally open to other approaches. The PR reminds me of my attempts just before I gave up on the idea to have the grammar do the heavy lifting and decided to go with the simpler line based and a subsequent transformation. So I wonder where you're at.

I think that it's good to combine both approaches: parsing of semantic blocks where possible, and line-based parsing where it gets messy with EBNF.

I already wrote EBNF for property-drawers and I think it's pretty clean. For example, tables should be easy to implement as semantic objects in EBNF, too. Same goes for "verbatim containers" like #+BEGIN_EXAMPLE. Here it makes sense to parse the contents directly as raw text.

On the other hand, if we do that stuff in the transformation step it's much more coding with conditionals, map/reduce, ... Similar to what is implemented in organice or other orgmode parser libraries.

But I agree, using EBNF can get messy or impractical. One example are #+BEGIN_xxx and #+END_xxx where the xxx can be anything but must match. AFAIK this cannot be accomplished in our EBNF unless we hardcode all possible xxx (src, example, center, quote, ...).

So I'd vote for putting as much "syntax comprehension" in the EBNF as long it can be expressed cleanly. The rest can be done in the transformation.

munen · 2020-05-17T08:37:10Z

I think that it's good to combine both approaches: parsing of semantic blocks where possible, and line-based parsing where it gets messy with EBNF.

...

So I'd vote for putting as much "syntax comprehension" in the EBNF as long it can be expressed cleanly. The rest can be done in the transformation.

I discussed this with branch14 we both agree. This is a sane and pragmatic approach. Let's continue like this.

It's also very nice that there are good examples for both options now^^

gcentauri · 2020-06-17T18:23:50Z

I'd like to get back to this sometime :) been busy with the crazy year that has been 2020. But Lisp keeps coming back to me and org mode has always been a love of mine too. i'll keep watching the repo and see where i can help. it was probably a bit impetuous of me to think i could figure out how to do the top down parsing over the line-based approach already begun :)

schoettl · 2020-06-17T21:37:26Z

You triggered a very good discussion @gcentauri :) A lot have happened since that. I'll get back to #11 soon as I can. It probably makes sense to built on that one to prevent conflicts.

munen · 2020-06-18T07:16:34Z

Thank you for your contribution, @gcentauri! All the best to you and your family 🙏

schoettl · 2021-05-23T21:12:11Z

Hey @gcentauri,

I suggest we close this PR.

A lot have changed since last year. We now have layed out a structure for the parse result (#31). I've also implemented parsing some block-like elements as semantic units (instead of line-based parsing). This semantic parsing has to be enabled step-by-step (#32). I'll start with that after other open PRs are merged.

branch14 · 2021-05-27T07:27:35Z

🙏

Add: first draft of parsing basic structure elements

1c42342

gcentauri mentioned this pull request Feb 10, 2020

[wip] Add: sample.org file and basic regression test setup #5

Closed

schoettl mentioned this pull request Apr 17, 2020

Question on this project's goal #8

Closed

munen mentioned this pull request May 17, 2020

Transform parse tree #15

Merged

munen force-pushed the master branch from 836aced to 6b988ba Compare January 4, 2021 16:03

schoettl mentioned this pull request May 15, 2021

Parse semantic blocks where appropriate #32

Open

9 tasks

gcentauri closed this May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Add: first draft of parsing basic structure elements #7

[wip] Add: first draft of parsing basic structure elements #7

gcentauri commented Feb 10, 2020

munen commented Feb 10, 2020

gcentauri commented Feb 18, 2020 •

edited

Loading

schoettl commented Apr 17, 2020

gcentauri commented Apr 20, 2020

schoettl commented Apr 20, 2020

branch14 commented May 13, 2020

branch14 commented May 13, 2020

schoettl commented May 13, 2020

munen commented May 17, 2020 •

edited

Loading

gcentauri commented Jun 17, 2020

schoettl commented Jun 17, 2020

munen commented Jun 18, 2020

schoettl commented May 23, 2021

branch14 commented May 27, 2021

[wip] Add: first draft of parsing basic structure elements #7

[wip] Add: first draft of parsing basic structure elements #7

Conversation

gcentauri commented Feb 10, 2020

munen commented Feb 10, 2020

gcentauri commented Feb 18, 2020 • edited Loading

schoettl commented Apr 17, 2020

gcentauri commented Apr 20, 2020

schoettl commented Apr 20, 2020

branch14 commented May 13, 2020

branch14 commented May 13, 2020

schoettl commented May 13, 2020

munen commented May 17, 2020 • edited Loading

gcentauri commented Jun 17, 2020

schoettl commented Jun 17, 2020

munen commented Jun 18, 2020

schoettl commented May 23, 2021

branch14 commented May 27, 2021

gcentauri commented Feb 18, 2020 •

edited

Loading

munen commented May 17, 2020 •

edited

Loading