Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Return elements pushed and popped rather than tags read. #6348

Closed
wants to merge 10 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Apr 3, 2024

Trac ticket: Core-61348

Summary

Creates virtual nodes when pushing to and popping from the stack of open elements. It's these nodes that are returned by next_tag(), while subclassed methods intercept tag information, all within the HTML Processor.

Splitting time!

  • Add get_current_depth() to return the depth of the currently-matched element in the stack of open elements. This needs to account for marker and other not-yet-implemented items in the stack.
  • Add expects_closer() or similar function to indicate if the currently-matched element needs or expects a closing element.

Questions

  • What do we do with void tags?
    • There are tradeoffs in how to represent tags without a closing tag (be they void elements or self-closing elements within foreign content or special atomic elements like SCRIPT). We can represent a unique open and close tag/event for them, or only expose the opening.
      • In exposing an open and a close there's a very straightforward and universal model of sequential open and close events for every tag.
      • It's weird to have a closing tag for an IMG.
    • I think that the value in having a close tag is outweighed by the awkwardness of it. For situations where we're doing things like tracking depth, we should rely on an internal method like get_depth() from the HTML API itself instead of exporting that HTML nuance onto the caller.

Related work

Examples

Convert to Markdown Output
Screenshot 2024-04-03 at 6 15 43 PM Screenshot 2024-04-03 at 6 17 06 PM
Normalize HTML Output
Screenshot 2024-04-03 at 6 15 43 PM Screenshot 2024-04-03 at 6 18 23 PM

cc: @sirreal

@dmsnell dmsnell requested review from ockham and adamziel April 3, 2024 08:48
@dmsnell dmsnell force-pushed the html-api/fake-balanced-html branch 2 times, most recently from 924f171 to 407a433 Compare April 3, 2024 08:54
Copy link

github-actions bot commented Apr 3, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the html-api/fake-balanced-html branch from d05ddd4 to 6457a0a Compare May 21, 2024 01:48
@dmsnell
Copy link
Member Author

dmsnell commented May 24, 2024

@sirreal it looks like we're failing tests after seeking, which I believe might be because when jumping back to the start of the document, we wipe out the context node.

@dmsnell dmsnell marked this pull request as ready for review June 2, 2024 17:07
Copy link

github-actions bot commented Jun 2, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, audrasjb, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@@ -424,28 +428,19 @@ public function next_tag( $query = null ) {
continue;
}

if ( ! parent::is_tag_closer() ) {
if ( ! parent::is_tag_closer() || $visit_closers ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also use $this?

Suggested change
if ( ! parent::is_tag_closer() || $visit_closers ) {
if ( ! $this->is_tag_closer() || $visit_closers ) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Can you concoct a test case to fail for it? If not, don't worry - we can just fix it.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a regression here where this HTML:

<?wp-bit hey?>

This is a PI_NODE_LOOKALIKE. Before, get_tag() here would be wp-bit - corresponding to the PI target. After this change get_tag() returns #comment.

Comment on lines +1390 to +1392
if ( isset( $this->current_element ) ) {
return $this->current_element->token->node_name;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what breaks the tag_name for PI lookalikes. The node_name is #comment, but it's a special comment type that has more handling in the Tag Processor:

$tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
return strtoupper( $tag_name );
}
if (
self::STATE_COMMENT === $this->parser_state &&
self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
) {
return $tag_name;
}

A potential fix is to only return here if we have a "non-special" node:

Suggested change
if ( isset( $this->current_element ) ) {
return $this->current_element->token->node_name;
}
if ( isset( $this->current_element ) && '#' !== $this->current_element->token->node_name[0] ) {
return $this->current_element->token->node_name;
}

@sirreal
Copy link
Member

sirreal commented Jun 3, 2024

The following HTML reaches a P tag at the wrong depth:

EDIT: This is pre-existing behavior.

redacted
<div></p>
$p = WP_HTML_Processor::create_fragment('<div></p>');
$p->next_token();
$p->next_token();
var_dump( $p->get_tag(), $p->get_current_depth(), $p->get_breadcrumbs() );
php shell code:1:
string(1) "P"
php shell code:1:
int(3)
php shell code:1:
array(3) {
  [0] =>
  string(4) "HTML"
  [1] =>
  string(4) "BODY"
  [2] =>
  string(3) "DIV"
}

pento pushed a commit that referenced this pull request Jun 3, 2024
HTML is a kind of short-hand for a DOM structure. This means that there are
many cases in HTML where an element's opening tag or closing tag is missing (or
both). This is because many of the parsing rules imply creating elements in the
DOM which may not exist in the text of the HTML.

The HTML Processor, being the higher-level counterpart to the Tag Processor, is
already aware of these nodes, but since it's inception has not paused on them
when scanning through a document. Instead, these are visible when pausing on a
child of such an element, but otherwise not seen.

In this patch the HTML Processor starts exposing those implicitly-created nodes,
including opening tags, and closing tags, that aren't foudn in the text content
of the HTML input document.

Previously, the sequence of matched tokens when scanning with 
`WP_HTML_Processor::next_token()` would depend on how the HTML document was written,
but with this patch, all semantically equal HTML documents will parse and scan in
the same exact manner, presenting an idealized or "perfect" view of the document
the same way as would occur when traversing a DOM in a browser.

Developed in #6348
Discussed in https://core.trac.wordpress.org/ticket/61348

Props audrasjb, dmsnell, gziolo, jonsurrell.
Fixes #61348.


git-svn-id: https://develop.svn.wordpress.org/trunk@58304 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to WordPress/WordPress that referenced this pull request Jun 3, 2024
HTML is a kind of short-hand for a DOM structure. This means that there are
many cases in HTML where an element's opening tag or closing tag is missing (or
both). This is because many of the parsing rules imply creating elements in the
DOM which may not exist in the text of the HTML.

The HTML Processor, being the higher-level counterpart to the Tag Processor, is
already aware of these nodes, but since it's inception has not paused on them
when scanning through a document. Instead, these are visible when pausing on a
child of such an element, but otherwise not seen.

In this patch the HTML Processor starts exposing those implicitly-created nodes,
including opening tags, and closing tags, that aren't foudn in the text content
of the HTML input document.

Previously, the sequence of matched tokens when scanning with 
`WP_HTML_Processor::next_token()` would depend on how the HTML document was written,
but with this patch, all semantically equal HTML documents will parse and scan in
the same exact manner, presenting an idealized or "perfect" view of the document
the same way as would occur when traversing a DOM in a browser.

Developed in WordPress/wordpress-develop#6348
Discussed in https://core.trac.wordpress.org/ticket/61348

Props audrasjb, dmsnell, gziolo, jonsurrell.
Fixes #61348.

Built from https://develop.svn.wordpress.org/trunk@58304


git-svn-id: http://core.svn.wordpress.org/trunk@57761 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented Jun 3, 2024

Merged in [58304]
163c3fd

@dmsnell dmsnell closed this Jun 3, 2024
@dmsnell dmsnell deleted the html-api/fake-balanced-html branch June 3, 2024 19:48
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jun 3, 2024
HTML is a kind of short-hand for a DOM structure. This means that there are
many cases in HTML where an element's opening tag or closing tag is missing (or
both). This is because many of the parsing rules imply creating elements in the
DOM which may not exist in the text of the HTML.

The HTML Processor, being the higher-level counterpart to the Tag Processor, is
already aware of these nodes, but since it's inception has not paused on them
when scanning through a document. Instead, these are visible when pausing on a
child of such an element, but otherwise not seen.

In this patch the HTML Processor starts exposing those implicitly-created nodes,
including opening tags, and closing tags, that aren't foudn in the text content
of the HTML input document.

Previously, the sequence of matched tokens when scanning with 
`WP_HTML_Processor::next_token()` would depend on how the HTML document was written,
but with this patch, all semantically equal HTML documents will parse and scan in
the same exact manner, presenting an idealized or "perfect" view of the document
the same way as would occur when traversing a DOM in a browser.

Developed in WordPress/wordpress-develop#6348
Discussed in https://core.trac.wordpress.org/ticket/61348

Props audrasjb, dmsnell, gziolo, jonsurrell.
Fixes #61348.

Built from https://develop.svn.wordpress.org/trunk@58304


git-svn-id: https://core.svn.wordpress.org/trunk@57761 1a063a9b-81f0-0310-95a4-ce76da25c4cd
*
* @param Closure $handler The handler function.
*/
public function set_pop_handler( Closure $handler ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is specifically a Closure and isn't just more generally a callable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closures cannot be serialized or deserialized, meaning that there's no possible way to prep a database record with user input that sets something unexpected here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants