-
Notifications
You must be signed in to change notification settings - Fork 3.1k
HTML API: Double escape character refs when setting attribute values #10143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); | ||
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) | ||
? esc_url( $value ) | ||
: htmlspecialchars( $value, ENT_QUOTES | ENT_HTML5 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches how set_modifiable_text()
escapes:
wordpress-develop/src/wp-includes/html-api/class-wp-html-tag-processor.php
Lines 3756 to 3761 in abc5dd8
if ( self::STATE_TEXT_NODE === $this->parser_state ) { | |
$this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( | |
$this->text_starts_at, | |
$this->text_length, | |
htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 ) | |
); |
This performs escaping like this:
<
=><
>
=>>
&
=>&
"
=>"
'
=>'
This is more escaping than is strictly necessary because "
is always used a few lines later, but it aligns with common practices in WordPress and the other HTML API escaping.
$updated_attribute = "{$name}=\"{$escaped_new_value}\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t trust the PHP man page descriptions: here is what it says in a special called-out note:
Note that this function does not translate anything beyond what is listed above. For full entity translation, see
htmlentities()
but it does more
$s = "z\x00\x95\xC2\x96b\xED\xa0\x80"; // letter, null, invalid byte, C1 control, letter, surrogate half
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 ) );
string(0) ""
// minimal subpart leaves us with four substitute characters
// (what looks like the second substitute character is being display by the browser, not the bytes forming U+FFFD, as it renders a C1 control)
var_dump( mb_scrub( $s, 'UTF-8' ) );
string(17) "z��b���"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE ) );
string(11) "z��b�"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_IGNORE ) );
string(5) "z�b"
var_dump( htmlspecialchars( $s, ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED ) );
string(0) ""
given that this is yet-again another PHP function which doesn’t do what it says, and that we generally rely on the browser in the HTML API to handle invalid encodings, maybe we should avoid this here and update set_modifiable_text()
to only replace those characters you mention. strtr()
and str_replace()
should both be fast, I would expect, and possibly faster than htmlspecialchars()
anyway, and it doesn’t require deep dives into the PHP source code or creating complex test cases to make the behavior obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed both escapes to use strtr()
in 1435b6e.
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The HTML5 encoding uses ' instead of the numeric character reference.
The "double encode" behavior of |
This comment was marked as resolved.
This comment was marked as resolved.
*/ | ||
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); | ||
$escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) | ||
? esc_url( $value ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The esc_url()
function also does some stuff with entities, namely:
wordpress-develop/src/wp-includes/formatting.php
Lines 4522 to 4527 in fe2b33f
// Replace ampersands and single quotes only when displaying. | |
if ( 'display' === $_context ) { | |
$url = wp_kses_normalize_entities( $url ); | |
$url = str_replace( '&', '&', $url ); | |
$url = str_replace( "'", ''', $url ); | |
} |
To avoid that, I think using esc_url_raw()
is preferable:
? esc_url( $value ) | |
? htmlspecialchars( esc_url_raw( $value ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is only a formatting change, I broke the ternary onto three lines.
Can we consider that suggestion in a separate ticket and PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, but it it is related to the double-escaping issue, because esc_url()
does wp_kses_normalize_entities()
which includes converting all &
to &
among other things. For example, these:
var_dump( esc_url( 'http://example.com/?foo&bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );
var_dump( esc_url( 'http://example.com/?foo&bar' ) );
All result in:
string(32) "http://example.com/?foo&bar"
string(32) "http://example.com/?foo&bar"
string(32) "http://example.com/?foo&bar"
I don't really see why a URI attribute should be treated differently at the HTML API layer. So this could instead just be:
$escaped_new_value = strtr(
$value,
array(
'<' => '<',
'>' => '>',
'&' => '&',
'"' => '"',
"'" => ''',
)
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that the double-escaping issue also exists in esc_url()
, that's unfortunate. esc_url()
does a lot of things, it doesn't seem as easily replaced as esc_attr()
or esc_html()
. I suspect more discussion will be necessary regarding URL-related changes which is why I'm reluctant to include those changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirreal why not use &
for ampersand to avoid issues in kses
? are you not worried about this content then being corrupted on further passes through the sanitizers?
|
Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now. |
* | ||
* For string attributes, the value is escaped using the `esc_attr` function. | ||
* HTML escaping will be performed for string attribute values. The input value should | ||
* not be unescaped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a confusing double-negative here.
can we rearrange this so it’s a positive indication of what to do, rather than a negative impression of what not to do?
* This function handles all HTML escaping: send normal and unescaped string values.
* The HTML API will convey input strings such that a browser will read the decoded
* string values as verbatim copies to what is provided here.
*
* Example:
*
* // Renders “Eggs & Milk” in a browser, encoded as “Eggs & Milk”.
* $processor->set_attribute( 'title', 'Eggs & Milk' );
*
* // Renders “Eggs & Milk” in a browser, encoded as “Eggs &amp; Milk”.
* $processor->set_attribute( 'title', 'Eggs & Milk' );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the description in 0db1374, happy to make further adjustments.
Trac ticket: https://core.trac.wordpress.org/ticket/64054
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.