Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support content after HTML end tag #4297

Merged
merged 13 commits into from
Mar 7, 2020

Conversation

schlessera
Copy link
Collaborator

@schlessera schlessera commented Feb 15, 2020

Summary

The normalize_document_structure() method in Dom\Document didn't support content after the HTML end tag. In that case, it failed to remove the end tag, and the subsequent manipulations to normalize further resulted in an overwritten tag that lost all attributes.

This PR changes the regular expressions and the reassembly behavior so that comments are properly kept intact and in the correct ordering across the HTML structure.

It also changes most regular expressions to use atomic groups where needed to avoid any backtracking and improve regex performance.

Fixes #4282

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@googlebot googlebot added the cla: yes Signed the Google CLA label Feb 15, 2020
@schlessera schlessera added Sanitizers Bug Something isn't working labels Feb 15, 2020
Copy link
Contributor

@pierlon pierlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A result match may not be found when looking for html_end, as shown by this failed Travis build.

Co-Authored-By: Pierre Gordon <16200219+pierlon@users.noreply.github.com>
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixing the destruction of the <body> when there is a comment after the </html>.

However, it is not fixing the problem when there is a comment after the </body> but before the </body>.

Also, the comments should maintain to their original relative location, which is important for source stack tracing.

@westonruter
Copy link
Member

However, it is not fixing the problem when there is a comment after the </body> but before the </body>.

In Twenty Twenty's footer.php, if I change the end to be:

	</body>
<!-- a comment! -->
</html>

Then the result is a body tag that looks like <body id="body">, where the class attribute is lost.

@westonruter
Copy link
Member

Accidentally closed when merging #4299.

@westonruter westonruter reopened this Feb 17, 2020
@westonruter westonruter added this to the v1.5 milestone Feb 17, 2020
@schlessera
Copy link
Collaborator Author

Also, the comments should maintain to their original relative location, which is important for source stack tracing.

I think that particular comment is not being moved by the normalization, but by PHP's DOMDocument. I'll look into it.

@schlessera
Copy link
Collaborator Author

Oh, no, I'm mistaken, it is indeed moved by the normalization. I'll fix it.

@schlessera
Copy link
Collaborator Author

I think I'll hav to recode the normalization to tokenize first and then iterate over tokens.

While trying to make more robust tests for comment positioning I used this HTML:

<!-- before <!doctype> --><!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><!-- first within <head> --><meta charset="utf-8"><!-- last within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> -->

And lo and behold, I get this test result:

-    1 => '<!-- before <!doctype>'
-    3 => ' --'
-    4 => '>'
-    5 => '<!DOCTYPE html>'
-    7 => '<!-- before <html>'
-    9 => ' --'
-    10 => '>'
-    11 => '<html>'
-    13 => '<!-- before <head>'
+    1 => '<!DOCTYPE html>'
+    3 => '<html>'
+    5 => '<head>'
+    7 => '<meta charset="utf-8">'
+    9 => '</head>'
+    11 => '<body>'
+    13 => '<!-- before <!doctype>'
     15 => ' --'
     16 => '>'
-    17 => '<head>'
-    19 => '<!-- first within <head>'
-    21 => ' --'
-    22 => '>'
-    23 => '<meta charset="utf-8">'
-    25 => '<!-- last within <head>'
+    17 => '<!-- before <html>'
+    19 => ' --'
+    21 => '<!-- before <head>'
+    23 => ' --'
+    25 => '<!-- first within <head>'
     27 => ' --'
     28 => '>'
-    29 => '</head>'
-    31 => '<!-- before <body>'
-    33 => ' --'
-    34 => '>'
-    35 => '<body class="something" data-something="something">'
+    29 => '<!-- last within <head>'
+    31 => ' --'
+    33 => '<!-- before <body>'
+    35 => ' --'
     37 => '<!-- within <body>'
     39 => ' --'
     40 => '>'
-    41 => '</body>'
-    43 => '<!-- after </body>'
-    45 => ' --'
-    46 => '>'
-    47 => '</html>'
-    49 => '<!-- after </html>'
-    51 => ' --'
-    52 => '>'
+    41 => '<!-- after </body>'
+    43 => ' --'
+    45 => '<!-- after </html>'
+    47 => ' --'
+    49 => '</body>'
+    51 => '</html>'
+    20 => '>'
+    24 => '>'
+    32 => '>'
+    36 => '>'
+    44 => '>'
+    48 => '>'

This means that, of course, my regular expressions are also triggered by tags that are found within comments. I don't think there's a clean and performant way of actually ignoring these comments with a singular regular expression. So I'll rewrite this to tokenize everything first and then rearrange. I don't need to tokenize every single tag, only the big pieces like first-level comments, body, head, etc...
That should make it robust enough to deal with the comments properly.

@schlessera
Copy link
Collaborator Author

Maybe not all hope is lost. A large part of the problem is the assertEqualMarkup which splits up the comments. I'll first try to salvage the existing code therefore.

@schlessera
Copy link
Collaborator Author

schlessera commented Feb 21, 2020

@westonruter This is what the code currently supports and keeps intext in terms of comment ordering:

<!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><meta charset="utf-8"><!-- within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> -->

What doesn't work yet is:

  1. a comment before the <!doctype>, as the DOMDocument moves it to be the very first tag automatically;
  2. a comment as first node of <head>, as we currently enforce our <meta charset="utf-8"> to be the very first node within <head>.

For 1., it would be cumbersome to fix, as we'd have to store additional markup in some form and then re-establish order after parsing.
For 2., it would mean making the code to deal with the charset a bit more complex to keep the charset intact but adapt the actual encoding stored in it, as opposed to removing whatever charset and re-adding a normalized one as we do now.

For these two, do you want me to invest time in fixing them, or as these not worth it?

@westonruter
Copy link
Member

@schlessera in the case of the doctype, it probably doesn't matter because we force it to be the HTML5 version anyway, thus there is no possibility for a validation error. So if a source stack open comment gets moved to be after the doctype and then is immediately followed by the source stack closing comment, then that is fine.

I suppose the same could be said for the meta charset.

As long as the source stack comments retain their relative position to each other (so a closing comment doesn't come before an opening one) then if the markup inside being annotated gone because we force it to be something, then I think it's okay.

Aside: this brings up something important for the Optimizer work. We need to make sure that the Optimizer does not run during validation requests, since the reordering of elements in the head will probably cause elements to lose the source stack context.

@westonruter
Copy link
Member

westonruter commented Mar 7, 2020

Reader Ready for conflict resolution. Take note of #4333 (comment)

@westonruter
Copy link
Member

@schlessera I tried modifying Twenty Twenty's footer.php to:

	</body>
</html>
<!--after html!-->

But when accessing the AMP page, I see:

	<!--after html!--></body>
</html>

That doesn't seem right?

@schlessera
Copy link
Collaborator Author

Maybe I messed it up during the conflict resolution. I'll have a look at it.

@schlessera
Copy link
Collaborator Author

So the test case for Dom\Document covers this case and passes. This means it would be caused somewhere else.

Can you confirm that you're testing the exact PR in its latest state, and that your vendor folders point to the right files?

@schlessera
Copy link
Collaborator Author

Nevermind, I can confirm it happens on my site too.

@schlessera
Copy link
Collaborator Author

@westonruter
Copy link
Member

In that case I think what you have may be correct. The important thing is maintaining relative position to other markup. So if there is some action that is done after the closing HTML tag, then the markup including comments should get moved inside the body with their order preserved.

@@ -76,7 +76,7 @@ final class Document extends DOMDocument
*
* @var string
*/
const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?:"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#';
const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?>"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL regex atomic grouping!

Comment on lines 2186 to 2201
// Move anything after </html>, such as Query Monitor output added at shutdown, to be moved before </body>.
while ( $dom->documentElement->nextSibling ) {
/*
* Move any non-comment elements after </html>, such as Query Monitor output added at shutdown, to be moved
* before </body>.
*/
$next_sibling = $dom->documentElement->nextSibling;
while ( $next_sibling ) {
// Trailing elements after </html> will get wrapped in additional <html> elements.
if ( 'html' === $dom->documentElement->nextSibling->nodeName ) {
while ( $dom->documentElement->nextSibling->firstChild ) {
$dom->body->appendChild( $dom->documentElement->nextSibling->firstChild );
if ( 'html' === $next_sibling->nodeName ) {
while ( $next_sibling->firstChild ) {
$dom->body->appendChild( $next_sibling->firstChild );
}
$dom->removeChild( $dom->documentElement->nextSibling );
} else {
$dom->body->appendChild( $dom->documentElement->nextSibling );
$dom->removeChild( $next_sibling );
} elseif ( ! $next_sibling instanceof DOMComment ) {
$dom->body->appendChild( $next_sibling );
}
$next_sibling = $next_sibling->nextSibling;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to not be quite right. If element are moved but not comment nodes, then the source stacks can get confused.

Consider a footer.php ending with:

		<?php wp_footer(); ?>

	</body>
	<?php do_action( 'after_closing_body' ); ?>
</html>
<?php do_action( 'after_closing_html' ); ?>

And a plugin like this:

<?php
/**
 * Plugin Name: Try After Actions
 */

add_action( 'after_closing_body', function () {
	echo '<script>after_closing_body</script>';
} );

add_action( 'after_closing_html', function () {
	echo '<script>after_closing_html</script>';
} );

At the moment this results in validation errors looking like this:

image

Notice the lack of source identified for the first.

So what I think needs to be done is every node appearing after </body> and </html> should be moved to be appended to the body, regardless of the node type. This will ensure that the source stack comments maintain their relative position with the elements they annotate.

Ultimately it should rather go into \AmpProject\Dom\Document::loadHTML()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch incoming…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e4f10d1 and 62fd552

$head = $this->getElementsByTagName(Tag::HEAD)->item(0);
if (! $head) {
$this->head = $this->createElement(Tag::HEAD);
$this->insertBefore($this->head, $this->firstChild);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, like the body handling below, appears to have been a bug where it was appending the head to the document and not to the documentElement.

@westonruter westonruter merged commit c2733af into develop Mar 7, 2020
@westonruter westonruter deleted the 4282-support-content-after-html-end-tag branch March 7, 2020 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working cla: yes Signed the Google CLA Sanitizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure to normalize documents with HTML comments after </body>
4 participants