-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KSES: Preserve some additional invalid HTML comment syntaxes. #6395
KSES: Preserve some additional invalid HTML comment syntaxes. #6395
Conversation
When `wp_kses_split` processes a document it attempts to leave HTML comments relatively alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many other kinds which appear as the result of various invalid HTML markup. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to allow two additional kinds of HTML comments: - HTML comments with the incorrect closer `--!>`. - Closing tags with an invalid tag name, e.g. `</%dolly>`. In an HTML parser these all become comments, and so leaving them in the document should be a benign operation, improving the reliability of detecting comments in Core. These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax.
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:
https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/ |
src/wp-includes/kses.php
Outdated
* it should be interpreted as an HTML comment. It extends until the | ||
* first `>` character after the initial opening `</`. | ||
*/ | ||
if ( 1 === preg_match( '~</[^a-zA-Z][^>]*>~', $content ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this bales out an returns $content
if there is a match anywhere in the string. Is this expected? Would this kick in if that's inside of an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call: I think this needs anchors. I'll review it and add them accordingly
Other than my |
I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch only addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.
Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular. In a local branch I was playing with replacing For instance, WordPress currently strips away
There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required. |
Is there a list of all the possible invalid comment syntaxes? |
They are not listed separated like this but you can find them in the parsing errors in the specification. The Tag Processor draws its own classification which is retrievable with
It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders. |
@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at This file has ANSI terminal codes for clearer output. You can |
d18593b
to
cc39a88
Compare
src/wp-includes/kses.php
Outdated
( # Detect comments of various flavors before attempting to find tags. | ||
(?:<!--.*?(--!?>|$)) # Well-formed HTML comments like `<!-- ... -->` and also with invalid `--!>` closer. | ||
| | ||
</[^a-zA-Z][^>]*> # Closing tags with invalid tag names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question: could we somehow reuse the same part of the regular expression </[^a-zA-Z][^>]*>
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not stupid, but while we can do that, it's valuable IMO to leave this separate, as the closing tag with an invalid tag name has an incredibly simple and straightforward syntax pattern: the tag name starts with anything that isn't a
- z
(ASCII case-insensitive) and extends until the first >
.
So by leaving this separate we can be even more careful about what we're examining. For example, opening tags with invalid tag names are not interpreted as invalid comments. The following pattern just isn't very robust in detecting normal tags let alone invalid ones, so as long as the correct approach is simpler being separate, I think it's worthwhile not joining them.
Thanks for omitting the others improvements, it really helps to understand that they are not related to allowing the some additional invalid comment syntaxes and are completely separate fixes. It also reduces the risk of a revert in the unlikely that case there's a problem with the additional fixes. Let's do those in a separate PR 👍 |
while ( $transformed !== $content ) { | ||
$transformed = wp_kses( $content, $allowed_html, $allowed_protocols ); | ||
$content = $transformed; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is copied from the existing comment behavior below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use an inline comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me know, love the changes to make it more concise.
@@ -1080,12 +1092,48 @@ function _wp_kses_split_callback( $matches ) { | |||
function wp_kses_split2( $content, $allowed_html, $allowed_protocols ) { | |||
$content = wp_kses_stripslashes( $content ); | |||
|
|||
// It matched a ">" character. | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should wp_kses_split2
also get an updated @since
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great idea. Thanks for catching it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe add an extra unit test case?
Found an issue with encoding HTML comments, I'll examine. Update: Was testing locally on another branch and got confused; all is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| | ||
</[^a-zA-Z][^>]*> # - Closing tags with invalid tag names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are essentially the only change to the regex pattern (an addition).
* first `>` character after the initial opening `</`. | ||
* | ||
* Preserve these comments and do not treat them like tags. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the >
character ever not terminate the string? In other words, what would change if you removed the $ from the regexp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I know now. Then we couldn’t assume bkw that the content starts at the at index two and ends at index minus one
* | ||
* Preserve these comments and do not treat them like tags. | ||
*/ | ||
if ( 1 === preg_match( '~^</[^a-zA-Z][^>]*>$~', $content ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any need to check for any invalid comment closers? I’m on a bus and can't remember exactly, but there were cases like triple dash or cdata-lookalike closer. Or does none of that matter with these tag-closer-like comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few notes I mean mostly as double-checking questions - I think you got everything right here and we’re good to go. Thank you!
As a side note, I’d love to see a fuzzer trying all the different risky inputs with and without every kses PR
6c708fb
to
5949d4c
Compare
When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in #6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. git-svn-id: https://develop.svn.wordpress.org/trunk@58418 602fd350-edb4-49c9-b593-d223f7449a82
When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in WordPress/wordpress-develop#6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. Built from https://develop.svn.wordpress.org/trunk@58418 git-svn-id: http://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd
When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in WordPress/wordpress-develop#6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. Built from https://develop.svn.wordpress.org/trunk@58418 git-svn-id: https://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-61009
Status
The one test failure appears to be related to deprecating the outdated IE comment syntax. It's not a problem because the test enshrines a mis-parsing of special HTML comments.
This change fixes that mis-parsing and will prevent further corruption.
Description
When
wp_kses_split
processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:
<!-->
,<!-- --!>
, etc…</3>
,</%happy>
, etc…<![CDATA[…]]>
<?include "blarg">
This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:
--!>
, because this one was a simple and easy change.</%dolly>
, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.
On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.
Discussion
Concerning the removed
kses
test, as far as I can tell there is no support for conditional comments like these after IE9, which I believe we don't support anymore. All modern browsers will interpret the entire span of content as a single comment.Unfortunately there's a twist to the attack which follow best-practice recommendations for conditional comments, which is to add a short abruptly-closed comment after the opening.
The addition of
<!-->
ends the opening comment and re-reveals the<SCRIPT>
tag. However, this should not affect the security of WordPress because it will be looking for theSCRIPT
tags through other code paths after the removal of this restriction.