-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support content after HTML end tag #4297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A result match may not be found when looking for html_end
, as shown by this failed Travis build.
Co-Authored-By: Pierre Gordon <16200219+pierlon@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixing the destruction of the <body>
when there is a comment after the </html>
.
However, it is not fixing the problem when there is a comment after the </body>
but before the </body>
.
Also, the comments should maintain to their original relative location, which is important for source stack tracing.
In Twenty Twenty's </body>
<!-- a comment! -->
</html> Then the result is a |
Accidentally closed when merging #4299. |
I think that particular comment is not being moved by the normalization, but by PHP's |
Oh, no, I'm mistaken, it is indeed moved by the normalization. I'll fix it. |
I think I'll hav to recode the normalization to tokenize first and then iterate over tokens. While trying to make more robust tests for comment positioning I used this HTML: <!-- before <!doctype> --><!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><!-- first within <head> --><meta charset="utf-8"><!-- last within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> --> And lo and behold, I get this test result:
This means that, of course, my regular expressions are also triggered by tags that are found within comments. I don't think there's a clean and performant way of actually ignoring these comments with a singular regular expression. So I'll rewrite this to tokenize everything first and then rearrange. I don't need to tokenize every single tag, only the big pieces like first-level comments, body, head, etc... |
Maybe not all hope is lost. A large part of the problem is the |
@westonruter This is what the code currently supports and keeps intext in terms of comment ordering: <!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><meta charset="utf-8"><!-- within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> --> What doesn't work yet is:
For 1., it would be cumbersome to fix, as we'd have to store additional markup in some form and then re-establish order after parsing. For these two, do you want me to invest time in fixing them, or as these not worth it? |
@schlessera in the case of the doctype, it probably doesn't matter because we force it to be the HTML5 version anyway, thus there is no possibility for a validation error. So if a source stack open comment gets moved to be after the doctype and then is immediately followed by the source stack closing comment, then that is fine. I suppose the same could be said for the meta charset. As long as the source stack comments retain their relative position to each other (so a closing comment doesn't come before an opening one) then if the markup inside being annotated gone because we force it to be something, then I think it's okay. Aside: this brings up something important for the Optimizer work. We need to make sure that the Optimizer does not run during validation requests, since the reordering of elements in the |
|
@schlessera I tried modifying Twenty Twenty's </body>
</html>
<!--after html!--> But when accessing the AMP page, I see: <!--after html!--></body>
</html> That doesn't seem right? |
Maybe I messed it up during the conflict resolution. I'll have a look at it. |
So the test case for Can you confirm that you're testing the exact PR in its latest state, and that your |
Nevermind, I can confirm it happens on my site too. |
It's caused by this piece of code: https://github.com/ampproject/amp-wp/blob/1.4.4/includes/class-amp-theme-support.php#L2281-L2295 |
In that case I think what you have may be correct. The important thing is maintaining relative position to other markup. So if there is some action that is done after the closing HTML tag, then the markup including comments should get moved inside the body with their order preserved. |
@@ -76,7 +76,7 @@ final class Document extends DOMDocument | |||
* | |||
* @var string | |||
*/ | |||
const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?:"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#'; | |||
const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?>"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL regex atomic grouping!
includes/class-amp-theme-support.php
Outdated
// Move anything after </html>, such as Query Monitor output added at shutdown, to be moved before </body>. | ||
while ( $dom->documentElement->nextSibling ) { | ||
/* | ||
* Move any non-comment elements after </html>, such as Query Monitor output added at shutdown, to be moved | ||
* before </body>. | ||
*/ | ||
$next_sibling = $dom->documentElement->nextSibling; | ||
while ( $next_sibling ) { | ||
// Trailing elements after </html> will get wrapped in additional <html> elements. | ||
if ( 'html' === $dom->documentElement->nextSibling->nodeName ) { | ||
while ( $dom->documentElement->nextSibling->firstChild ) { | ||
$dom->body->appendChild( $dom->documentElement->nextSibling->firstChild ); | ||
if ( 'html' === $next_sibling->nodeName ) { | ||
while ( $next_sibling->firstChild ) { | ||
$dom->body->appendChild( $next_sibling->firstChild ); | ||
} | ||
$dom->removeChild( $dom->documentElement->nextSibling ); | ||
} else { | ||
$dom->body->appendChild( $dom->documentElement->nextSibling ); | ||
$dom->removeChild( $next_sibling ); | ||
} elseif ( ! $next_sibling instanceof DOMComment ) { | ||
$dom->body->appendChild( $next_sibling ); | ||
} | ||
$next_sibling = $next_sibling->nextSibling; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turns out to not be quite right. If element are moved but not comment nodes, then the source stacks can get confused.
Consider a footer.php
ending with:
<?php wp_footer(); ?>
</body>
<?php do_action( 'after_closing_body' ); ?>
</html>
<?php do_action( 'after_closing_html' ); ?>
And a plugin like this:
<?php
/**
* Plugin Name: Try After Actions
*/
add_action( 'after_closing_body', function () {
echo '<script>after_closing_body</script>';
} );
add_action( 'after_closing_html', function () {
echo '<script>after_closing_html</script>';
} );
At the moment this results in validation errors looking like this:
Notice the lack of source identified for the first.
So what I think needs to be done is every node appearing after </body>
and </html>
should be moved to be appended to the body
, regardless of the node type. This will ensure that the source stack comments maintain their relative position with the elements they annotate.
Ultimately it should rather go into \AmpProject\Dom\Document::loadHTML()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch incoming…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$head = $this->getElementsByTagName(Tag::HEAD)->item(0); | ||
if (! $head) { | ||
$this->head = $this->createElement(Tag::HEAD); | ||
$this->insertBefore($this->head, $this->firstChild); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, like the body
handling below, appears to have been a bug where it was appending the head
to the document and not to the documentElement
.
Summary
The
normalize_document_structure()
method inDom\Document
didn't support content after the HTML end tag. In that case, it failed to remove the end tag, and the subsequent manipulations to normalize further resulted in an overwritten tag that lost all attributes.This PR changes the regular expressions and the reassembly behavior so that comments are properly kept intact and in the correct ordering across the HTML structure.
It also changes most regular expressions to use atomic groups where needed to avoid any backtracking and improve regex performance.
Fixes #4282
Checklist