Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin: Adding a first version of block parser using a pegjs grammar #301

Merged
merged 6 commits into from
Mar 23, 2017

Conversation

youknowriad
Copy link
Contributor

@youknowriad youknowriad commented Mar 21, 2017

in this PR, I'm adding a v1 for a block parser using a pegjs grammar.

Parsing a block like this:

<-- wp:myBlockType myAttr:value -->
<p>HTML content</p>
<-- /wp:myBlockType -->

returns the following output

[
  { 
    blockType: 'myBlockType',
    attrs: { myAttr: 'value' },
    rawContent: '<p>HTML content</p>'
  }
]

This approach means we need another step for parsing. This step would be part of the block API to extract attributes from the rawContent depending on the blockType.

@youknowriad youknowriad added the Framework Issues related to broader framework topics, especially as it relates to javascript label Mar 21, 2017
@youknowriad youknowriad self-assigned this Mar 21, 2017
@youknowriad youknowriad requested review from mtias and aduth March 21, 2017 12:19
@youknowriad youknowriad force-pushed the add/grammar-parsing branch from e17ad5e to dad8f83 Compare March 21, 2017 12:20
@aduth
Copy link
Member

aduth commented Mar 21, 2017

I know this is pretty blocking, but there's some assumptions in the grammar that will change perhaps significantly depending on the API we arrive at for blocks (#104).

For example:

This approach means we need another step for parsing.

Ideally not, whether this is through treating state values explicitly (including text) or an API from the block's own rendering function to "retrieve" values.

@youknowriad
Copy link
Contributor Author

youknowriad commented Mar 21, 2017

an API from the block's own rendering function to "retrieve" values.

this is still another parsing step, because a block needs to define the Minimum required attributes to retrieve from the content (could be just a simple array), and fallback to html if not present.

My idea was to get this and #289 merged. That way, we can try the "retrieve" values approach.

@aduth
Copy link
Member

aduth commented Mar 21, 2017

because a block needs to define the Minimum required attributes to retrieve from the content (could be just a simple array)

Does it? Why can't we infer an empty/unset value at the time of rendering?

@youknowriad
Copy link
Contributor Author

Does it? Why can't we infer an empty/unset value at the time of rendering?

I was assuming some blocks have mandatory attributes. Maybe not after reflexion. But you agree with me that this step (the grammar) is required (or replaced by a dom parser) in either of these approaches

@aduth
Copy link
Member

aduth commented Mar 21, 2017

Yeah, I'd be inclined to simplify the grammar even further from the start to avoid even concepts of attributes in comments. Not sure the effort involved to do this, but if we can at least try to avoid relying on those assumptions too much in implementation, this could be a good base to start with 👍

nylen
nylen previously requested changes Mar 21, 2017
} }

WP_Block_End
= "<!--" __ "/wp" __ "-->"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to do it but this needs to include the block name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok making this change if we reach a consensus. I feel that it hasn't been discussed yet.

I'm personally neutral, I don't think it adds much value, (I see the HTML as a serialisation that should not be updated frequently by hand, it's our job to make the best possible editor to avoid having to use the HTML editor) but I understand that some might find it more readable.

@aduth @mtias thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"needs" is a strong word 😄

Honestly I care very little one way or the other.

Copy link
Member

@nylen nylen Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as necessary for several reasons:

  • A fallback so that we can reliably detect more failure conditions. What if someone edits the HTML(+ comment delimiters) and accidentally transposes part of a block into another block? We need to be able to throw a parse error here, and if our ending delimiters are just generic /wp, we can't really do that.
  • For complicated posts, WP post content is very much "tag soup" already, and eventually as people start using the block-based editor these comment delimiters will become the same way. Especially with very long blocks, it needs to be clear both to humans and computers reading the post content not just that a block is ending, but which block is ending along with the beginning of a new block.
  • If we ever want to be able to support nested blocks this becomes even more of a necessity. (Conversely, if we never support nested blocks, this doesn't hurt anything.)

As usual I am trying to argue in favor of the most robust approach possible from the very start, because we're building a product which will need to support a lot of complicated, varied, and unexpected use cases. I think that if we don't do this now, we will wish we had, but it'll be too late to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing a little bit of devil's advocate: How is this any different than how the JavaScript runtime is smart enough to interpret the closing curly brace } of an if condition. Other languages might make this more explicit e.g. fi, endif but certainly doesn't seem indicative of a poorly constructed syntax.

Copy link
Contributor

@BE-Webdesign BE-Webdesign Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't "need" to be there but definitely should be. It will make it easy for human eyeballs rather than computer eyeballs. Imagine trying to read html or xml without a matching closing tag, a parser could technically understand it but it would be a lot less friendly to work with.

I think the points nylen brought up are all very compelling as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the closing curly brace } of an if condition

You're right, this is a problem. We should consider writing the plugin code in VBScript to alleviate it.

Trolling aside, I can think of a couple of relevant differences between C-style languages and HTML:

  • HTML is designed to be deeply nested, whereas in code this is generally considered an antipattern, and readability of nesting is managed by strict indentation rules.
  • HTML has far more "block types" than JavaScript, and we are introducing the capability to define an arbitrary number of other tags in addition to the standard ones.

I also think of what we're building here as an extension of HTML. Therefore it makes sense to me to have fairly similar tag semantics to our "parent language".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the purposes of these is to remain as out of the way as possible. A few people have asked me to not even have closing tags ;)

That said, I'm ok with either way now, and I'd expect us to refine the approach as we test with complex, long, and broken posts. The key is to balance readability of the HTML with the presence of structural hints.

@youknowriad
Copy link
Contributor Author

In d427daf I updated the PR to add the block type to the closing comment. I also created another "example post" to avoid breaking the prototypes that rely on the old format.

@nylen
Copy link
Member

nylen commented Mar 22, 2017

A couple of questions about this, which may be easily answered due to my lack of knowledge of pegjs:

  • How do we expect plugins to extend this grammar with their own block structure? Will this parsing layer include the HTML parsing needed to extract information about block content?
  • Does this functionality need tests? What is the best way to achieve this?

@youknowriad
Copy link
Contributor Author

How do we expect plugins to extend this grammar with their own block structure? Will this parsing layer include the HTML parsing needed to extract information about block content?

This is the second part of the parse I'm talking about above. There are different approaches.

  • I'm leaning to add a parse function each block API needs to implement to extract the state from the HTML. The grammar can not be extended per block. (See per-block prototype for an example of this)

  • @aduth and other are proposing to have all the state data stored elsewhere (attributes or post-meta), so no need to parse the HTML which becomes only an output. (serialization of the state)

Does this functionality need tests? What is the best way to achieve this?

We can add unit tests ( parse(string) => JS and check the JS) but I'm not sure this is relevant since the code is generated.

@youknowriad youknowriad dismissed nylen’s stale review March 22, 2017 14:12

Closing comment updated

@youknowriad youknowriad force-pushed the add/grammar-parsing branch from 479d175 to ccc6a7b Compare March 22, 2017 14:39
@aduth
Copy link
Member

aduth commented Mar 23, 2017

The manual revisions to allow importing the CommonJS PEG output was bothering me since it seemed a pain to maintain. Tracked this back to #4039 which doesn't have much of a workaround.

Instead, I found a Webpack loader for PEG.js which seems much more convenient because it avoids a separate build for the grammar (import directly from grammar). Only complication is that it was skipping the Babelification step, so I had to update the grammar a bit to downgrade ES6 to ES5. There might be another way to fix this, but it was a reasonable compromise I thought.

In any case, I'm feeling good with this as a step forward. Feel free to merge in your morning 👍

@aduth
Copy link
Member

aduth commented Mar 23, 2017

Oh, lame, it appears the changes in a5c83e8 broke the build for the exact reason we wanted Webpack involved in Mocha tests in #289, namely since Mocha alone is incapable of understanding loaders.

Unless we can get that working, maybe best to revert back to the original approach for now.

@youknowriad
Copy link
Contributor Author

@aduth What if we avoid loading this file on the tests, by moving the "registration functions" into a separate file and leave all these "webpack" specific loaders to the index.js

@youknowriad youknowriad force-pushed the add/grammar-parsing branch from 9aabad0 to 7e045d4 Compare March 23, 2017 08:16
@youknowriad
Copy link
Contributor Author

Tests fixed, merging this for now. we can revisit later of we discover any limit to this parsing approach

@youknowriad youknowriad merged commit c6995cd into master Mar 23, 2017
@youknowriad youknowriad deleted the add/grammar-parsing branch March 23, 2017 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework Issues related to broader framework topics, especially as it relates to javascript
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants