Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract directive parsing into subparser #3020

Merged
merged 22 commits into from
Sep 29, 2021

Conversation

mvriel
Copy link
Member

@mvriel mvriel commented Sep 24, 2021

No description provided.

As a first step into migrating Directive handling to a subparser, I have
created the Subparser but limited its interactions to the creation/init
part.

Since the implementation currently creates a Directive Handler and
assigns that to a property in the DocumentParser, I can get away with
this slim step. Subsequent commits will migrate more into the Subparser.
ContentBlocks

In this change, I had to make a couple more extensive changes in order
for Directives to work. Directives are treated as a sort of pushdown
system where as a property directiveParser has a non-null value; then
the next Code node is added as a ContentBlock.

This too should be changed in a future commit to properly support nested
/ tree parsing but we need to take small steps.
After having reworked a series of Node types into substates, I have been
able to get a glimpse of the larger system. It seems, the original
author attempted a line-based parser that could backtrack (hence, why it
could re-parse every line until a parse was successful).

During research on how others have approached and fixed this issue, I
found references to LL(1) or Recursive Descend Parsers without
backtracking and with 1 token look ahead. When I compare that method, it
seems very suitable in this situation and requires a bit of rework to
introduce "productions". This term is taken from LL(1) parser theory to
indicate a means to start at a given token (line in our case) and
construct a product or AST node by moving the cursor forward.

This deviates from the original implementation where the movement of the
cursor was driven by a foreach statement. By changing this to a while
statement it is not a problem is productions themselves move the cursor
forward.
The title is tightly coupled to sections, which are structural
elements, and through this it means that we have to rework the parser a
bit. In this change, I have promoted the DocumentParser to a Production
Rule as a first step towards nested production rules and I have reworked
parts so that section handling is actually done through the TitleRule.

This is not the final phase, but I had to reduce the complexity of the
refactoring.

After this commit, the parser is in a broken state unfortunately because
it is no longer possible to use a strangler pattern to keep it working
with old and new components. Subsequent commits will restore
functionality by moving more and more pieces of the parser into
production rules.
During implementation, I found out I had to tweak parts of the
recognition of characters for quote rules because of issues with
whitelines. Since a QuoteBlock recreates a parser and passes its
contents through it (to be changed), it ended up in an infinite loop
when the content merely contained an empty string.
As the next step in finalizing the design of the parser for Restructured
Text, I am migrating directive handling to a production rule. This is
also a good example how we can re-use other Production Rules to get a
compound rule.

The implementation is a bit rough, and there are one or two bugs
remaining (I lost the last line of text in a note?) but it is a good
starting point
@mvriel mvriel marked this pull request as ready for review September 28, 2021 19:39
Copy link
Member

@jaapio jaapio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look for bugs but just at the bigger picture. And I think I can follow what you are doing here.
Having a production for each state which is able to move the iterator forward makes a lot of sense to me. A minor issue would be that a production rule is able to step back. I would not allow this in the iterator at all. Each line must be peeked by a look ahead or consumed. If we would allow rules to move before they were triggered we could get some odd bugs.

src/Guides/RestructuredText/Parser/DocumentIterator.php Outdated Show resolved Hide resolved
src/Guides/RestructuredText/Parser/DocumentIterator.php Outdated Show resolved Hide resolved
src/Guides/RestructuredText/Parser/DocumentParser.php Outdated Show resolved Hide resolved
src/Guides/RestructuredText/Parser/DocumentIterator.php Outdated Show resolved Hide resolved
@wouterj
Copy link
Contributor

wouterj commented Sep 29, 2021

Interesting! I'll have a closer look at these changes this week.
This is getting closer to the state machine parser used by docutils, which should allow following the specs easier.

@jaapio jaapio deleted the feature/extract-directive-parsing-into-subparser branch September 29, 2021 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants