KSES: Preserve some additional invalid HTML comment syntaxes. #6395

dmsnell · 2024-04-15T14:32:47Z

Status

The one test failure appears to be related to deprecating the outdated IE comment syntax. It's not a problem because the test enshrines a mis-parsing of special HTML comments.

This change fixes that mis-parsing and will prevent further corruption.

Description

When wp_kses_split processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others.

In HTML there are many kinds of invalid markup which, according to the specification, are to be interpreted as an HTML comment. These include, but are not limited to:

HTML comments with invalid syntax, <!-->, <!-- --!>, etc…
HTML closing tags whose tag name is invalid </3>, </%happy>, etc…
Things that look like XML CDATA sections, <![CDATA[…]]>
Things that look like XML Processor Instruction nodes, <?include "blarg">

This patch makes a minor adjustment to the algorithm in wp_kses_split to allow two additional kinds of HTML comments:

HTML comments with the incorrect closer --!>, because this one was a simple and easy change.
Closing tags with an invalid tag name, e.g. </%dolly>, because these are required to open up explorations in Gutenberg on Bits, a new iteration of dynamic tokens for externally-sourced data, or "Shortcodes 2.0"

These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax because as long as Core removes them, there's no way to load content from the database and experiment on the full life cycle of potential Bits systems.

On its own, however, this represents a kind of bug fix for Core, making the implementation of wp_kses_split() more closely align with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments (because that is a much deeper issue and involves many more questions about existing expectations) but it does propose a couple of hopefully and expectedly minor fixes that hopefully won't break any existing code or projects.

Discussion

Concerning the removed kses test, as far as I can tell there is no support for conditional comments like these after IE9, which I believe we don't support anymore. All modern browsers will interpret the entire span of content as a single comment.

Unfortunately there's a twist to the attack which follow best-practice recommendations for conditional comments, which is to add a short abruptly-closed comment after the opening.

<!--[if gte IE 4]><!-->
<SCRIPT>alert('XSS');</SCRIPT>
<![endif]-->

The addition of <!--> ends the opening comment and re-reveals the <SCRIPT> tag. However, this should not affect the security of WordPress because it will be looking for the SCRIPT tags through other code paths after the removal of this restriction.

Warning: Should verify this in a clear way.

When `wp_kses_split` processes a document it attempts to leave HTML comments relatively alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many other kinds which appear as the result of various invalid HTML markup. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to allow two additional kinds of HTML comments: - HTML comments with the incorrect closer `--!>`. - Closing tags with an invalid tag name, e.g. `</%dolly>`. In an HTML parser these all become comments, and so leaving them in the document should be a benign operation, improving the reliability of detecting comments in Core. These invalid closing tags, which in a browser are interpreted as comments, are one proposal for a placeholder mechanism in the HTML API unlocking HTML templating, a new kind of shortcode, and more. Having these persist in Core is a requirement for exploring and utilizing the new syntax.

github-actions · 2024-04-15T14:33:04Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, zieladam, ellatrix, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2024-04-15T14:49:22Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

adamziel · 2024-04-16T09:54:40Z

The test failure is related to IE conditional comments. However, dealing with those shouldn't matter anymore:

Conditional comments are conditional statements interpreted by Microsoft Internet Explorer versions 5 through 9 in HTML source code

https://www.sitepoint.com/microsoft-drop-ie10-conditional-comments/

adamziel · 2024-04-16T09:59:37Z

src/wp-includes/kses.php

+	 * it should be interpreted as an HTML comment. It extends until the
+	 * first `>` character after the initial opening `</`.
+	 */
+	if ( 1 === preg_match( '~</[^a-zA-Z][^>]*>~', $content ) ) {


So this bales out an returns $content if there is a match anywhere in the string. Is this expected? Would this kick in if that's inside of an attribute?

good call: I think this needs anchors. I'll review it and add them accordingly

src/wp-includes/kses.php

adamziel · 2024-04-16T10:13:52Z

Other than my preg_match question, this LGTM. I'd just remove the failing test case, conditional comments were only relevant up to IE9 which WordPress doesn't support anymore. And even if it did, the new output doesn't seem like an XSS vector. I'd still get inputs from others, cc @westonruter

westonruter · 2024-04-16T18:16:56Z

Confirmed that Chrome parses these as comments:

<!-- comment 1 -->
foo
<!-- comment 2 --!>
bar
</%comment3>
baz

How common are these other invalid comments?

I presume you'll be adding tests specifically for the changes here?

dmsnell · 2024-04-17T13:50:15Z

How common are these other invalid comments?

I'm guessing rare, but it might be a fun experiment to scan some sites and count them. This patch only addresses a single kind of invalid comment leaving most other constructions alone. This is because I'm trying to be tactical here and leave a small surface area of change.

I presume you'll be adding tests specifically for the changes here?

Yes absolutely. Before I do that though I want to make sure folks are on board with the change. If we use these "funky comments" for Bits, template placeholders, translation wrappers, they could become quite popular.

In a local branch I was playing with replacing wp_kses_split2 using the HTML API and it provides some great affordances, but as usual, far more questions to resolve.

For instance, WordPress currently strips away <:::>, interpreting it as an invalid tag. It's not though, it's legitimate plaintext despite being invalid markup. So now what do we do?

Do we attempt to maintain this legacy behavior that corrupts documents that otherwise are fine?
Do we preserve this legitimate HTML and allow its surprising form to thread through a stack of code expecting more familiar HTML?
Do we eagerly replace confusing HTML like this before passing it along into WordPress, taking a performance hit to provide regularity and mitigate further misinterpretation by later code on the render chain?

There are further nuances. For now I would like to preserve just this one case so that we can get testing and exploring the UX of potential "Bits" systems. In my opinion this code is no worse off than the existing code, so the fact that it doesn't solve more syntax irregularities seems like something that shouldn't be required.

westonruter · 2024-04-17T20:52:02Z

This patch only addresses a single kind of invalid comment leaving most other constructions alone.

Is there a list of all the possible invalid comment syntaxes?

dmsnell · 2024-04-18T15:47:50Z

Is there a list of all the possible invalid comment syntaxes?

They are not listed separated like this but you can find them in the parsing errors in the specification.

The Tag Processor draws its own classification which is retrievable with get_comment_type(), whereas it distinguishes:

Comments that are basically well-formed.
Comments that were abruptly closed.
Comments that appear as though someone wanted to write a CDATA section.
Comments that appear as though someone wanted to write an SGML processing instruction.
Comments that form from a variety of other invalid HTML constructions.

It separates out comments that appear as the result of being a closing tag with an invalid tag name. These it calls funky comments and these are the proposed syntax for Bits and other atomic shortcode like placeholders.

dmsnell · 2024-04-19T16:47:57Z

How common are these other invalid comments?

@westonruter I took a list of the top-ranked domains and grabbed the first 296,000 pages at / for the domain.

invalid-comments.txt

This file has ANSI terminal codes for clearer output. You can cat it and see colors…

ellatrix · 2024-06-12T15:47:40Z

src/wp-includes/kses.php

+	(                         # Detect comments of various flavors before attempting to find tags.
+		(?:<!--.*?(--!?>|$))  # Well-formed HTML comments like `<!-- ... -->` and also with invalid `--!>` closer.
+		|
+		</[^a-zA-Z][^>]*>     # Closing tags with invalid tag names.


Stupid question: could we somehow reuse the same part of the regular expression </[^a-zA-Z][^>]*> below?

Not stupid, but while we can do that, it's valuable IMO to leave this separate, as the closing tag with an invalid tag name has an incredibly simple and straightforward syntax pattern: the tag name starts with anything that isn't a - z (ASCII case-insensitive) and extends until the first >.

So by leaving this separate we can be even more careful about what we're examining. For example, opening tags with invalid tag names are not interpreted as invalid comments. The following pattern just isn't very robust in detecting normal tags let alone invalid ones, so as long as the correct approach is simpler being separate, I think it's worthwhile not joining them.

ellatrix · 2024-06-12T15:54:45Z

Thanks for omitting the others improvements, it really helps to understand that they are not related to allowing the some additional invalid comment syntaxes and are completely separate fixes. It also reduces the risk of a revert in the unlikely that case there's a problem with the additional fixes. Let's do those in a separate PR 👍

dmsnell · 2024-06-12T16:01:07Z

src/wp-includes/kses.php

+		while ( $transformed !== $content ) {
+			$transformed = wp_kses( $content, $allowed_html, $allowed_protocols );
+			$content     = $transformed;
+		}


this is copied from the existing comment behavior below

Could use an inline comment?

ellatrix

Looks good to me know, love the changes to make it more concise.

ellatrix · 2024-06-12T16:05:59Z

src/wp-includes/kses.php

@@ -1080,12 +1092,48 @@ function _wp_kses_split_callback( $matches ) {
 function wp_kses_split2( $content, $allowed_html, $allowed_protocols ) {
 	$content = wp_kses_stripslashes( $content );

-	// It matched a ">" character.
+	/*


Should wp_kses_split2 also get an updated @since?

Yes, great idea. Thanks for catching it!

ellatrix

Should we maybe add an extra unit test case?

dmsnell · 2024-06-12T16:21:17Z

Found an issue with encoding HTML comments, I'll examine.

Update: Was testing locally on another branch and got confused; all is fine.

westonruter

LGTM!

westonruter · 2024-06-12T16:34:17Z

src/wp-includes/kses.php

+		|
+		</[^a-zA-Z][^>]*>  #  - Closing tags with invalid tag names.


These two lines are essentially the only change to the regex pattern (an addition).

adamziel · 2024-06-12T19:15:07Z

src/wp-includes/kses.php

+	 * first `>` character after the initial opening `</`.
+	 *
+	 * Preserve these comments and do not treat them like tags.
+	 */


Would the > character ever not terminate the string? In other words, what would change if you removed the $ from the regexp?

Aha, I know now. Then we couldn’t assume bkw that the content starts at the at index two and ends at index minus one

adamziel · 2024-06-12T19:18:43Z

src/wp-includes/kses.php

+	 *
+	 * Preserve these comments and do not treat them like tags.
+	 */
+	if ( 1 === preg_match( '~^</[^a-zA-Z][^>]*>$~', $content ) ) {


Is there any need to check for any invalid comment closers? I’m on a bus and can't remember exactly, but there were cases like triple dash or cdata-lookalike closer. Or does none of that matter with these tag-closer-like comments?

adamziel

I left a few notes I mean mostly as double-checking questions - I think you got everything right here and we’re good to go. Thank you!

As a side note, I’d love to see a fuzzer trying all the different risky inputs with and without every kses PR

When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in #6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. git-svn-id: https://develop.svn.wordpress.org/trunk@58418 602fd350-edb4-49c9-b593-d223f7449a82

dmsnell · 2024-06-15T06:33:27Z

Merged in [58418]
c756dfe

When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in WordPress/wordpress-develop#6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. Built from https://develop.svn.wordpress.org/trunk@58418 git-svn-id: http://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd

When `wp_kses_split` processes a document it attempts to leave HTML comments alone. It makes minor adjustments, but leaves the comments in the document in its output. Unfortunately it only recognizes one kind of HTML comment and rejects many others. This patch makes a minor adjustment to the algorithm in `wp_kses_split` to recognize and preserve an additional kind of HTML comment: closing tags with an invalid tag name, e.g. `</%dolly>`. These invalid closing tags must be interpreted as comments by a browser. This bug fix aligns the implementation of `wp_kses_split()` more closely with its stated goal of leaving HTML comments as comments. It doesn't attempt to fully fix the mis-parsed comments, but it does propose a minor fix that hopefully won't break any existing code or projects. Developed in WordPress/wordpress-develop#6395 Discussed in https://core.trac.wordpress.org/ticket/61009 Props ellatrix, dmsnell, joemcgill, jorbin, westonruter, zieladam. See #61009. Built from https://develop.svn.wordpress.org/trunk@58418 git-svn-id: https://core.svn.wordpress.org/trunk@57867 1a063a9b-81f0-0310-95a4-ce76da25c4cd

Replace the invalid comment closer as well.

0601582

adamziel reviewed Apr 16, 2024

View reviewed changes

src/wp-includes/kses.php Outdated Show resolved Hide resolved

adamziel mentioned this pull request Apr 16, 2024

Auto-post a Playground preview link on every PR WordPress/gutenberg#60795

Open

Remove kses test verifying guard against IE conditional comments.

fd66dda

dmsnell mentioned this pull request Jun 7, 2024

Add prefix / suffix to Post Date block WordPress/gutenberg#61920

Open

dmsnell added 2 commits June 12, 2024 15:54

Ensure PCRE pattern matches full string match.

e0e36b0

Run comment content through kses

cc39a88

dmsnell force-pushed the html-api/allow-storing-funky-comments branch from d18593b to cc39a88 Compare June 12, 2024 13:55

dmsnell added 3 commits June 12, 2024 16:01

Bring invalid test back, update the expected output.

5eacbc3

Remove additional fix.

7c44482

Remove other fixes.

b60d0e0

ellatrix reviewed Jun 12, 2024

View reviewed changes

Remove other fix, update docs.

8a31702

PCRE pattern cleanup.

5dbbe50

dmsnell commented Jun 12, 2024

View reviewed changes

ellatrix approved these changes Jun 12, 2024

View reviewed changes

ellatrix reviewed Jun 12, 2024

View reviewed changes

Update @SInCE tags

dca7d00

westonruter approved these changes Jun 12, 2024

View reviewed changes

adamziel reviewed Jun 12, 2024

View reviewed changes

adamziel approved these changes Jun 12, 2024

View reviewed changes

Add a basic unit test to prevent regressions.

5949d4c

dmsnell force-pushed the html-api/allow-storing-funky-comments branch from 6c708fb to 5949d4c Compare June 12, 2024 20:19

Merge branch 'trunk' into html-api/allow-storing-funky-comments

5ae60b3

dmsnell closed this Jun 15, 2024

dmsnell deleted the html-api/allow-storing-funky-comments branch June 15, 2024 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

dmsnell commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024

adamziel commented Apr 16, 2024

adamziel Apr 16, 2024

dmsnell May 27, 2024

adamziel commented Apr 16, 2024 •

edited

Loading

westonruter commented Apr 16, 2024

dmsnell commented Apr 17, 2024

westonruter commented Apr 17, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 19, 2024

ellatrix Jun 12, 2024

dmsnell Jun 12, 2024

ellatrix commented Jun 12, 2024

dmsnell Jun 12, 2024

ellatrix Jun 12, 2024

ellatrix left a comment •

edited

Loading

ellatrix Jun 12, 2024

dmsnell Jun 12, 2024

ellatrix left a comment

dmsnell commented Jun 12, 2024 •

edited

Loading

westonruter left a comment

westonruter Jun 12, 2024

adamziel Jun 12, 2024

adamziel Jun 12, 2024

adamziel Jun 12, 2024 •

edited

Loading

adamziel left a comment

dmsnell commented Jun 15, 2024 •

edited

Loading

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

KSES: Preserve some additional invalid HTML comment syntaxes. #6395

Conversation

dmsnell commented Apr 15, 2024 • edited Loading

Status

Description

Discussion

github-actions bot commented Apr 15, 2024 • edited Loading

github-actions bot commented Apr 15, 2024

Test using WordPress Playground

Some things to be aware of

adamziel commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel commented Apr 16, 2024 • edited Loading

westonruter commented Apr 16, 2024

dmsnell commented Apr 17, 2024

westonruter commented Apr 17, 2024

dmsnell commented Apr 18, 2024

dmsnell commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix left a comment

Choose a reason for hiding this comment

dmsnell commented Jun 12, 2024 • edited Loading

westonruter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

adamziel left a comment

Choose a reason for hiding this comment

dmsnell commented Jun 15, 2024 • edited Loading

dmsnell commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

adamziel commented Apr 16, 2024 •

edited

Loading

ellatrix left a comment •

edited

Loading

dmsnell commented Jun 12, 2024 •

edited

Loading

adamziel Jun 12, 2024 •

edited

Loading

dmsnell commented Jun 15, 2024 •

edited

Loading