Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep track of the currently-open elements while walking over DOM and use depth-first post-order traversal to improve processing #3243

Merged

Conversation

westonruter
Copy link
Member

@westonruter westonruter commented Sep 15, 2019

This PR builds upon #3206 to simplify the logic for the AMP_Tag_And_Attribute_Sanitizer. Instead of introducing a new context construct which is added to the stack/queue, this PR replaces the contexts with a true recursive walking over the DOM tree, and keeping track of the current number of open elements by type at any given point. The number of open elements of given names can then be used not only to determine the context for whether or not attribute validation is currently happening inside of a template, but also it can be used to speed up checks for whether a given element has a particular ancestor.

  • Walk over tree with depth-first search and post-order node traversal. This post-order traversal also fixes Unexpected validation error parent element when invalid elements are nested #1493 (thus superseding Improve validation of children of invalid elements #1511), as the original parent node will be reported in a validation error since the deepest node is processed first.
  • Instead of stack array with contexts, recursively walk through DOM tree and keep track of the number of open ancestor elements.
  • Use open ancestor template elements to determine whether to skip validating attribute values.
  • Use open ancestor elements as a shortcut for the logic in AMP_Tag_And_Attribute_Sanitizer::get_ancestor_with_matching_spec_name().
  • Add DOMElement and array types for arguments to AMP_Tag_And_Attribute_Sanitizer methods.
  • Remove obsolete checks for arguments being DOMElement.
  • Eliminate unnecessary AMP_Tag_And_Attribute_Sanitizer::delegated_sanitize_disallowed_attribute_values_in_node().
  • Use data provider for test_replace_node_with_children_validation_errors.

Fixes #3137. Closes #3206. Fixes #1493.

schlessera and others added 8 commits September 5, 2019 08:17
…keep-attributes-with-mustache-placeholders-intact

* 'develop' of github.com:ampproject/amp-wp: (113 commits)
  This converts the keyboard cut handler to equal a copy handler to avoid bugs
  Fix aria-label adding helper function
  Remove extra line I added in resolving merge conflict
  Fix alignment of arrow
  Fix custom CTA text for page attachment
  Fix cut handler by shortcut
  Cleanup of duplicated comment
  Add unit testing to `addVideoAriaLabel`
  Remove unused piece of code
  Remove Cloudflare AMP cache since deprecated
  Handle cut keyboard requests. (#3231)
  Page Attachment block (#3035)
  Keyboard & Right-Click Menu Copy + Paste (#3083)
  Animation Previews (#3104)
  Make internal methods inaccessible
  Omit the ecosystem link also when using a core theme
  Add skipped e2e test for the video block
  Add array_colum() pollyfill for PHP < 5.5
  Add asserts to make sure we are not enqueueing both versions of dashicons
  Remove useless variable
  ...
… through DOM tree

* Instead of stack array with contexts, recursively walk through DOM tree and keep track of the number of open ancestor elements.
* Use open ancestor template elements to determine whether to skip validating attribute values.
* Use open ancestor elements as a shortcut for the logic in AMP_Tag_And_Attribute_Sanitizer::get_ancestor_with_matching_spec_name().
* Add DOMElement and array types for arguments to AMP_Tag_And_Attribute_Sanitizer methods.
* Remove obsolete checks for arguments being DOMElement.
* Eliminate unnecessary AMP_Tag_And_Attribute_Sanitizer::delegated_sanitize_disallowed_attribute_values_in_node().
* Use data provider for test_replace_node_with_children_validation_errors.
@westonruter westonruter added this to the v1.3 milestone Sep 15, 2019
@googlebot googlebot added the cla: yes Signed the Google CLA label Sep 15, 2019
…_node reported

Fix gathering of required component scripts with move to DFS+LRN. Now that the children are visited before the parents, the required component scripts must by passed up from visiting the children and only passed along if the parent does not then get removed.
@westonruter westonruter changed the base branch from fix/3137-keep-attributes-with-mustache-placeholders-intact to develop September 16, 2019 05:00
@googlebot
Copy link

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no Has not signed the Google CLA and removed cla: yes Signed the Google CLA labels Sep 16, 2019
@westonruter westonruter changed the title Replace AMP node contexts with tracking open elements while recursing through DOM tree Keep track of the currently-open elements while walking over DOM and use depth-first post-order traversal to improve processing Sep 16, 2019
@westonruter
Copy link
Member Author

There is currently an unrelated fatal error happening when run PHPUnit tests against WordPress trunk:

Fatal error: Cannot declare class WP_Block_Styles_Registry, because the name is already in use in /app/public/content/plugins/gutenberg/lib/class-wp-block-styles-registry.php on line 14

Also reported in Gutenberg: WordPress/gutenberg#17444.

The issue is resolved if I check out the current master branch of Gutenberg, so this appears to be a temporary problem that will be fixed with the next release of Gutenberg, so this may warrant merging without that job passing. The E2E failure also seems unrelated.

@westonruter
Copy link
Member Author

In regards to the the CLA complaint, it seems like a bug with the bot because all of the commits from @schlessera were OK in the other PR: #3206. So I'm happy to forcibly flip the bit to cla: yes.

@schlessera
Copy link
Collaborator

@googlebot I consent.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes Signed the Google CLA and removed cla: no Has not signed the Google CLA labels Sep 16, 2019
@schlessera
Copy link
Collaborator

Regarding the array typehints: I'm all for adding stricter typehinting and you've noticed already that I've been adding them. But using array is broken and shouldn't be used, because it makes it impossible to replace dumb arrays by smarter iterables constructs later on. PHP 7 solved this by adding the iterable type, but we can't use that yet.

@schlessera
Copy link
Collaborator

Regarding the general approach of using open_templates[] with recursion:
I thought that the entire point of introducing the $stack in this sanitizer was to get away from recursion. PHP does not support tail call optimization (TCO), so there is not proper way of using recursion without introducing a non-scalable limitation. I think that the limitation might be high enough here to not be an actual issue, but that is hard to guess and can never be guaranteed.
So the code is generally fine with that simplification, if we keep this caveat in mind.
However, the template-specific logic should be changed, as the fact that there is currently only one single type of template is not something we should rely on.

@westonruter
Copy link
Member Author

Regarding the array typehints: I'm all for adding stricter typehinting and you've noticed already that I've been adding them. But using array is broken and shouldn't be used, because it makes it impossible to replace dumb arrays by smarter iterables constructs later on. PHP 7 solved this by adding the iterable type, but we can't use that yet.

@schlessera And here I was hoping you would be so proud of me 😄 But that makes perfect sense. Excellent point.

@westonruter
Copy link
Member Author

However, the template-specific logic should be changed, as the fact that there is currently only one single type of template is not something we should rely on.

We can actually rely on that currently because we will know the moment that non-Mustache templates are allowed because the change will correspond with spec updates. So we'll see it when performing a spec update, and we can make the required changes at that time (if ever).

@westonruter
Copy link
Member Author

I thought that the entire point of introducing the $stack in this sanitizer was to get away from recursion. PHP does not support tail call optimization (TCO), so there is not proper way of using recursion without introducing a non-scalable limitation. I think that the limitation might be high enough here to not be an actual issue, but that is hard to guess and can never be guaranteed.

@schlessera Interesting point. The tag-and-attribute sanitizer was introduced a long time ago (before my time) in #600, specifically in delputnam@e7e51f0. So I don't know if this was in mind when opting for a non-recursive solution. I kinda think that it may not have been on the radar, since the sanitizer was not yet concerned with validating constraints based on the location of a node in a hierarchy.

The concern here would be if an HTML page has a DOM tree with extreme nesting of elements. In HTML4 it appears the max nesting depth was 100 elements: https://bugzilla.mozilla.org/show_bug.cgi?id=256180#c36

This limit was lifted in HTML5, but it seems unusual to have documents with such deeply-nested elements (though I know they do exist). I can't seem to find any data on the frequency.

Perhaps this should be merged and then we can look at refactoring recursion to re-introduce iteration with a stack?

@westonruter
Copy link
Member Author

The E2E failure also seems unrelated.

It just passed!

image

@schlessera
Copy link
Collaborator

schlessera commented Sep 16, 2019

Perhaps this should be merged and then we can look at refactoring recursion to re-introduce iteration with a stack?

I'm okay with merging this and reconsidering for v2 (or until we hit bug reports).

@westonruter
Copy link
Member Author

@schlessera I just tested and it appears that libxml has a maximum nesting depth of 256.

Warning: DOMDocument::loadHTML(): Excessive depth in document: 256 use XML_PARSE_HUGE option

image

https://3v4l.org/2hpqn

So as long as PHP doesn't complain about recursing 256+ times, then this shouldn't be a problem faced by a recursive sanitize_element method. I added a5a27c4 to test this assumption.

@schlessera
Copy link
Collaborator

So as long as PHP doesn't complain about recursing 256+ times

PHP should be fine about that, but the default recursion limit if Xdebug is active is 100 (it can be raised, though).

@swissspidy
Copy link
Collaborator

This seems to have broken code coverage collection on Travis CI, see #3337.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA
Projects
None yet
4 participants