Support content after HTML end tag #4297

schlessera · 2020-02-15T16:50:46Z

Summary

The normalize_document_structure() method in Dom\Document didn't support content after the HTML end tag. In that case, it failed to remove the end tag, and the subsequent manipulations to normalize further resulted in an overwritten tag that lost all attributes.

This PR changes the regular expressions and the reassembly behavior so that comments are properly kept intact and in the correct ordering across the HTML structure.

It also changes most regular expressions to use atomic groups where needed to avoid any backtracking and improve regex performance.

Fixes #4282

Checklist

My pull request is addressing an open issue (please create one otherwise).
My code is tested and passes existing tests.
My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

pierlon

A result match may not be found when looking for html_end, as shown by this failed Travis build.

src/Dom/Document.php

Co-Authored-By: Pierre Gordon <16200219+pierlon@users.noreply.github.com>

westonruter

This is fixing the destruction of the <body> when there is a comment after the </html>.

However, it is not fixing the problem when there is a comment after the </body> but before the </body>.

Also, the comments should maintain to their original relative location, which is important for source stack tracing.

westonruter · 2020-02-15T20:41:50Z

However, it is not fixing the problem when there is a comment after the </body> but before the </body>.

In Twenty Twenty's footer.php, if I change the end to be:

	</body>
<!-- a comment! -->
</html>

Then the result is a body tag that looks like <body id="body">, where the class attribute is lost.

westonruter · 2020-02-17T19:18:52Z

Accidentally closed when merging #4299.

schlessera · 2020-02-21T07:44:46Z

Also, the comments should maintain to their original relative location, which is important for source stack tracing.

I think that particular comment is not being moved by the normalization, but by PHP's DOMDocument. I'll look into it.

schlessera · 2020-02-21T07:47:29Z

Oh, no, I'm mistaken, it is indeed moved by the normalization. I'll fix it.

schlessera · 2020-02-21T08:04:41Z

I think I'll hav to recode the normalization to tokenize first and then iterate over tokens.

While trying to make more robust tests for comment positioning I used this HTML:

<!-- before <!doctype> --><!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><!-- first within <head> --><meta charset="utf-8"><!-- last within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> -->

And lo and behold, I get this test result:

-    1 => '<!-- before <!doctype>'
-    3 => ' --'
-    4 => '>'
-    5 => '<!DOCTYPE html>'
-    7 => '<!-- before <html>'
-    9 => ' --'
-    10 => '>'
-    11 => '<html>'
-    13 => '<!-- before <head>'
+    1 => '<!DOCTYPE html>'
+    3 => '<html>'
+    5 => '<head>'
+    7 => '<meta charset="utf-8">'
+    9 => '</head>'
+    11 => '<body>'
+    13 => '<!-- before <!doctype>'
     15 => ' --'
     16 => '>'
-    17 => '<head>'
-    19 => '<!-- first within <head>'
-    21 => ' --'
-    22 => '>'
-    23 => '<meta charset="utf-8">'
-    25 => '<!-- last within <head>'
+    17 => '<!-- before <html>'
+    19 => ' --'
+    21 => '<!-- before <head>'
+    23 => ' --'
+    25 => '<!-- first within <head>'
     27 => ' --'
     28 => '>'
-    29 => '</head>'
-    31 => '<!-- before <body>'
-    33 => ' --'
-    34 => '>'
-    35 => '<body class="something" data-something="something">'
+    29 => '<!-- last within <head>'
+    31 => ' --'
+    33 => '<!-- before <body>'
+    35 => ' --'
     37 => '<!-- within <body>'
     39 => ' --'
     40 => '>'
-    41 => '</body>'
-    43 => '<!-- after </body>'
-    45 => ' --'
-    46 => '>'
-    47 => '</html>'
-    49 => '<!-- after </html>'
-    51 => ' --'
-    52 => '>'
+    41 => '<!-- after </body>'
+    43 => ' --'
+    45 => '<!-- after </html>'
+    47 => ' --'
+    49 => '</body>'
+    51 => '</html>'
+    20 => '>'
+    24 => '>'
+    32 => '>'
+    36 => '>'
+    44 => '>'
+    48 => '>'

This means that, of course, my regular expressions are also triggered by tags that are found within comments. I don't think there's a clean and performant way of actually ignoring these comments with a singular regular expression. So I'll rewrite this to tokenize everything first and then rearrange. I don't need to tokenize every single tag, only the big pieces like first-level comments, body, head, etc...
That should make it robust enough to deal with the comments properly.

schlessera · 2020-02-21T08:44:39Z

Maybe not all hope is lost. A large part of the problem is the assertEqualMarkup which splits up the comments. I'll first try to salvage the existing code therefore.

schlessera · 2020-02-21T09:34:37Z

@westonruter This is what the code currently supports and keeps intext in terms of comment ordering:

<!DOCTYPE html><!-- before <html> --><html><!-- before <head> --><head><meta charset="utf-8"><!-- within <head> --></head><!-- before <body> --><body class="something" data-something="something"><!-- within <body> --></body><!-- after </body> --></html><!-- after </html> -->

What doesn't work yet is:

a comment before the <!doctype>, as the DOMDocument moves it to be the very first tag automatically;
a comment as first node of <head>, as we currently enforce our <meta charset="utf-8"> to be the very first node within <head>.

For 1., it would be cumbersome to fix, as we'd have to store additional markup in some form and then re-establish order after parsing.
For 2., it would mean making the code to deal with the charset a bit more complex to keep the charset intact but adapt the actual encoding stored in it, as opposed to removing whatever charset and re-adding a normalized one as we do now.

For these two, do you want me to invest time in fixing them, or as these not worth it?

westonruter · 2020-03-04T15:36:54Z

@schlessera in the case of the doctype, it probably doesn't matter because we force it to be the HTML5 version anyway, thus there is no possibility for a validation error. So if a source stack open comment gets moved to be after the doctype and then is immediately followed by the source stack closing comment, then that is fine.

I suppose the same could be said for the meta charset.

As long as the source stack comments retain their relative position to each other (so a closing comment doesn't come before an opening one) then if the markup inside being annotated gone because we force it to be something, then I think it's okay.

Aside: this brings up something important for the Optimizer work. We need to make sure that the Optimizer does not run during validation requests, since the reordering of elements in the head will probably cause elements to lose the source stack context.

westonruter · 2020-03-07T06:16:34Z

~~Reader~~ Ready for conflict resolution. Take note of #4333 (comment)

westonruter · 2020-03-07T08:04:50Z

@schlessera I tried modifying Twenty Twenty's footer.php to:

	</body>
</html>
<!--after html!-->

But when accessing the AMP page, I see:

	<!--after html!--></body>
</html>

That doesn't seem right?

schlessera · 2020-03-07T08:26:25Z

Maybe I messed it up during the conflict resolution. I'll have a look at it.

schlessera · 2020-03-07T09:11:49Z

So the test case for Dom\Document covers this case and passes. This means it would be caused somewhere else.

Can you confirm that you're testing the exact PR in its latest state, and that your vendor folders point to the right files?

schlessera · 2020-03-07T09:14:03Z

Nevermind, I can confirm it happens on my site too.

schlessera · 2020-03-07T09:17:02Z

It's caused by this piece of code: https://github.com/ampproject/amp-wp/blob/1.4.4/includes/class-amp-theme-support.php#L2281-L2295

westonruter · 2020-03-07T14:26:38Z

In that case I think what you have may be correct. The important thing is maintaining relative position to other markup. So if there is some action that is done after the closing HTML tag, then the markup including comments should get moved inside the body with their order preserved.

westonruter · 2020-03-07T07:40:40Z

lib/common/src/Dom/Document.php

@@ -76,7 +76,7 @@ final class Document extends DOMDocument
     *
     * @var string
     */
-    const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?:"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#';
+    const AMP_BIND_ATTR_PATTERN = '#^\s+(?P<name>\[?[a-zA-Z0-9_\-]+\]?)(?P<value>=(?>"[^"]*+"|\'[^\']*+\'|[^\'"\s]+))?#';


TIL regex atomic grouping!

westonruter · 2020-03-07T19:48:43Z

includes/class-amp-theme-support.php

-		// Move anything after </html>, such as Query Monitor output added at shutdown, to be moved before </body>.
-		while ( $dom->documentElement->nextSibling ) {
+		/*
+		 * Move any non-comment elements after </html>, such as Query Monitor output added at shutdown, to be moved
+		 * before </body>.
+		 */
+		$next_sibling = $dom->documentElement->nextSibling;
+		while ( $next_sibling ) {
 			// Trailing elements after </html> will get wrapped in additional <html> elements.
-			if ( 'html' === $dom->documentElement->nextSibling->nodeName ) {
-				while ( $dom->documentElement->nextSibling->firstChild ) {
-					$dom->body->appendChild( $dom->documentElement->nextSibling->firstChild );
+			if ( 'html' === $next_sibling->nodeName ) {
+				while ( $next_sibling->firstChild ) {
+					$dom->body->appendChild( $next_sibling->firstChild );
 				}
-				$dom->removeChild( $dom->documentElement->nextSibling );
-			} else {
-				$dom->body->appendChild( $dom->documentElement->nextSibling );
+				$dom->removeChild( $next_sibling );
+			} elseif ( ! $next_sibling instanceof DOMComment ) {
+				$dom->body->appendChild( $next_sibling );
 			}
+			$next_sibling = $next_sibling->nextSibling;


This turns out to not be quite right. If element are moved but not comment nodes, then the source stacks can get confused.

Consider a footer.php ending with:

<?php wp_footer(); ?> </body> <?php do_action( 'after_closing_body' ); ?> </html> <?php do_action( 'after_closing_html' ); ?>

And a plugin like this:

<?php /** * Plugin Name: Try After Actions */ add_action( 'after_closing_body', function () { echo '<script>after_closing_body</script>'; } ); add_action( 'after_closing_html', function () { echo '<script>after_closing_html</script>'; } );

At the moment this results in validation errors looking like this:

Notice the lack of source identified for the first.

So what I think needs to be done is every node appearing after </body> and </html> should be moved to be appended to the body, regardless of the node type. This will ensure that the source stack comments maintain their relative position with the elements they annotate.

Ultimately it should rather go into \AmpProject\Dom\Document::loadHTML()

Patch incoming…

See e4f10d1 and 62fd552

…s/cbf scripts

westonruter · 2020-03-07T21:45:52Z

lib/common/src/Dom/Document.php

-        $head = $this->getElementsByTagName(Tag::HEAD)->item(0);
-        if (! $head) {
-            $this->head = $this->createElement(Tag::HEAD);
-            $this->insertBefore($this->head, $this->firstChild);


This, like the body handling below, appears to have been a bug where it was appending the head to the document and not to the documentElement.

Support content after html end tag

c0e2bf9

googlebot added the cla: yes Signed the Google CLA label Feb 15, 2020

schlessera requested review from westonruter, kienstra and pierlon February 15, 2020 16:50

schlessera added Sanitizers Bug Something isn't working labels Feb 15, 2020

pierlon suggested changes Feb 15, 2020

View reviewed changes

src/Dom/Document.php Outdated Show resolved Hide resolved

Update src/Dom/Document.php

85a448e

Co-Authored-By: Pierre Gordon <16200219+pierlon@users.noreply.github.com>

pierlon approved these changes Feb 15, 2020

View reviewed changes

schlessera mentioned this pull request Feb 15, 2020

Load block styles via amp_post_template_head action #4299

Merged

3 tasks

westonruter requested changes Feb 15, 2020

View reviewed changes

westonruter closed this in #4299 Feb 17, 2020

westonruter reopened this Feb 17, 2020

westonruter added this to the v1.5 milestone Feb 17, 2020

Preserve comment positions (without breaking structure)

ea1f630

Deal with whitespace around comments

bdb3b32

pierlon mentioned this pull request Feb 21, 2020

HTML outputted before the <!DOCTYPE> declaration produces invalid AMP page #4321

Closed

schlessera requested a review from westonruter February 22, 2020 05:43

pierlon mentioned this pull request Feb 25, 2020

Add AMP Optimizer library #4019

Merged

9 tasks

Merge branch 'develop' into 4282-support-content-after-html-end-tag

30a61c9

schlessera added 3 commits March 7, 2020 08:14

Merge branch 'develop' into 4282-support-content-after-html-end-tag

d0b8f82

Merge changes into moved Dom\Document class

455b094

Merge changes from #4193

e34c9fc

Skip moving comments outside of html tags

4dea305

westonruter reviewed Mar 7, 2020

View reviewed changes

westonruter added 2 commits March 7, 2020 11:54

Move any nodes appearing after </body> to be appended to the body

e4f10d1

Fix and harden normalizeDomStructure when document lacks <html>

62fd552

westonruter approved these changes Mar 7, 2020

View reviewed changes

westonruter added 2 commits March 7, 2020 13:22

Fix phpcs issues in common and optimizer libs; include tests dir in c…

1f9454f

…s/cbf scripts

Add missing namespace to common lib tests

a700be2

westonruter reviewed Mar 7, 2020

View reviewed changes

westonruter merged commit c2733af into develop Mar 7, 2020

westonruter deleted the 4282-support-content-after-html-end-tag branch March 7, 2020 22:48

westonruter mentioned this pull request Mar 7, 2020

Prevent removal of closing table and td tags in script[template="amp-mustache"] #4333

Merged

3 tasks

westonruter added the Changelogged label Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support content after HTML end tag #4297

Support content after HTML end tag #4297

schlessera commented Feb 15, 2020 •

edited

Loading

pierlon left a comment

westonruter left a comment

westonruter commented Feb 15, 2020

westonruter commented Feb 17, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020 •

edited

Loading

westonruter commented Mar 4, 2020

westonruter commented Mar 7, 2020 •

edited

Loading

westonruter commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

westonruter commented Mar 7, 2020

westonruter Mar 7, 2020

westonruter Mar 7, 2020

westonruter Mar 7, 2020

westonruter Mar 7, 2020

westonruter Mar 7, 2020

Support content after HTML end tag #4297

Support content after HTML end tag #4297

Conversation

schlessera commented Feb 15, 2020 • edited Loading

Summary

Checklist

pierlon left a comment

Choose a reason for hiding this comment

westonruter left a comment

Choose a reason for hiding this comment

westonruter commented Feb 15, 2020

westonruter commented Feb 17, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020

schlessera commented Feb 21, 2020 • edited Loading

westonruter commented Mar 4, 2020

westonruter commented Mar 7, 2020 • edited Loading

westonruter commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

schlessera commented Mar 7, 2020

westonruter commented Mar 7, 2020

westonruter Mar 7, 2020

Choose a reason for hiding this comment

westonruter Mar 7, 2020

Choose a reason for hiding this comment

westonruter Mar 7, 2020

Choose a reason for hiding this comment

westonruter Mar 7, 2020

Choose a reason for hiding this comment

westonruter Mar 7, 2020

Choose a reason for hiding this comment

schlessera commented Feb 15, 2020 •

edited

Loading

schlessera commented Feb 21, 2020 •

edited

Loading

westonruter commented Mar 7, 2020 •

edited

Loading