Skip to content

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented Oct 3, 2025

Attribute values set with the HTML API method set_attribute() are escaped with esc_attr(). That function avoids "double encoding" things that look like HTML character references.

The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.

By contrast, set_modifiable_text() does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.

The text & appears to be an encoded character reference:

$amp_text = '&';
$p = WP_HTML_Processor::create_fragment( '<p>x</p>' );
$p->next_tag();
$p->set_attribute( 'data-attr', $amp_text );
$p->next_token();
$p->set_modifiable_text( $amp_text );
echo $p->get_updated_html();

This prints the following HTML:

<p data-attr="&amp;">&amp;amp;</p>

Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &. The text node in the P element correctly renders & as expected.

Here's a demo of the difference in behavior between setting attributes and modifiable text.

Trac ticket: https://core.trac.wordpress.org/ticket/64054


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Copy link

github-actions bot commented Oct 3, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value );
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true )
? esc_url( $value )
: htmlspecialchars( $value, ENT_QUOTES | ENT_HTML5 );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches how set_modifiable_text() escapes:

if ( self::STATE_TEXT_NODE === $this->parser_state ) {
$this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement(
$this->text_starts_at,
$this->text_length,
htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 )
);

This performs escaping like this:

  • < => &lt;
  • > => &gt;
  • & => &amp;
  • " => &quot;
  • ' => &apos;

This is more escaping than is strictly necessary because " is always used a few lines later, but it aligns with common practices in WordPress and the other HTML API escaping.

$updated_attribute = "{$name}=\"{$escaped_new_value}\"";

Copy link
Member

@dmsnell dmsnell Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t trust the PHP man page descriptions: here is what it says in a special called-out note:

Note that this function does not translate anything beyond what is listed above. For full entity translation, see htmlentities()

but it does more

$s = "z\x00\x95\xC2\x96b\xED\xa0\x80"; // letter, null, invalid byte, C1 control, letter, surrogate half

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 ) );
string(0) ""

// minimal subpart leaves us with four substitute characters
// (what looks like the second substitute character is being display by the browser, not the bytes forming U+FFFD, as it renders a C1 control)
var_dump( mb_scrub( $s, 'UTF-8' ) );
string(17) "z��b���"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE ) );
string(11) "z��b�"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_IGNORE ) );
string(5) "z�b"

var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED ) );
string(0) ""

given that this is yet-again another PHP function which doesn’t do what it says, and that we generally rely on the browser in the HTML API to handle invalid encodings, maybe we should avoid this here and update set_modifiable_text() to only replace those characters you mention. strtr() and str_replace() should both be fast, I would expect, and possibly faster than htmlspecialchars() anyway, and it doesn’t require deep dives into the PHP source code or creating complex test cases to make the behavior obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed both escapes to use strtr() in 1435b6e.

Copy link

github-actions bot commented Oct 3, 2025

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member Author

sirreal commented Oct 6, 2025

The "double encode" behavior of esc_attr() (which really comes from htmlspecialchars()) is very strange when considering named character references that appear with and without a trailing ;. They're both character references and will be decoded by the browser (it's slightly more complex) but the double-encoded behavior does not treat them the same. Here's a good demo with some explanation.

@sirreal sirreal requested review from westonruter and dmsnell October 6, 2025 10:17
@sirreal

This comment was marked as resolved.

*/
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value );
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true )
? esc_url( $value )
Copy link
Member

@westonruter westonruter Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The esc_url() function also does some stuff with entities, namely:

// Replace ampersands and single quotes only when displaying.
if ( 'display' === $_context ) {
$url = wp_kses_normalize_entities( $url );
$url = str_replace( '&amp;', '&#038;', $url );
$url = str_replace( "'", '&#039;', $url );
}

To avoid that, I think using esc_url_raw() is preferable:

Suggested change
? esc_url( $value )
? htmlspecialchars( esc_url_raw( $value ) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is only a formatting change, I broke the ternary onto three lines.

Can we consider that suggestion in a separate ticket and PR?

Copy link
Member

@westonruter westonruter Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but it it is related to the double-escaping issue, because esc_url() does wp_kses_normalize_entities() which includes converting all & to &amp; among other things. For example, these:

var_dump( esc_url( 'http://example.com/?foo&amp;bar' ) );
var_dump( esc_url( 'http://example.com/?foo&#038;bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );

All result in:

string(32) "http://example.com/?foo&#038;bar"
string(32) "http://example.com/?foo&#038;bar"
string(32) "http://example.com/?foo&#038;bar"

I don't really see why a URI attribute should be treated differently at the HTML API layer. So this could instead just be:

$escaped_new_value = strtr(
	$value,
	array(
		'<' => '&lt;',
		'>' => '&gt;',
		'&' => '&amp;',
		'"' => '&quot;',
		"'" => '&apos;',
	)
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that the double-escaping issue also exists in esc_url(), that's unfortunate. esc_url() does a lot of things, it doesn't seem as easily replaced as esc_attr() or esc_html(). I suspect more discussion will be necessary regarding URL-related changes which is why I'm reluctant to include those changes here.

See r58472 and r58473.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal why not use &#38; for ampersand to avoid issues in kses? are you not worried about this content then being corrupted on further passes through the sanitizers?

@sirreal
Copy link
Member Author

sirreal commented Oct 8, 2025

&amp; seemed like the most common, expected, and legible encoding. r60616 addressed the KSES character reference issues I was aware aware of. Do you think we should use the numeric form?

@dmsnell
Copy link
Member

dmsnell commented Oct 8, 2025

Do you think we should use the numeric form?

Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now.

*
* For string attributes, the value is escaped using the `esc_attr` function.
* HTML escaping will be performed for string attribute values. The input value should
* not be unescaped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a confusing double-negative here.
can we rearrange this so it’s a positive indication of what to do, rather than a negative impression of what not to do?

 * This function handles all HTML escaping: send normal and unescaped string values.
 * The HTML API will convey input strings such that a browser will read the decoded
 * string values as verbatim copies to what is provided here.
 *
 * Example:
 *
 *     // Renders “Eggs & Milk” in a browser, encoded as “Eggs &amp; Milk”.
 *     $processor->set_attribute( 'title', 'Eggs & Milk' );
 *
 *     // Renders “Eggs &amp; Milk” in a browser, encoded as “Eggs &amp;amp; Milk”.
 *     $processor->set_attribute( 'title', 'Eggs &amp; Milk' );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the description in 0db1374, happy to make further adjustments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants