Gutenberg for non-admins and HTML entities #50050

TrevorMills · 2023-04-24T21:58:09Z

Description

As a non-administrator ( someone without unfiltered_html capability ), when I save a prop to a React based component, some characters are getting converted to their encoded versions - & becomes &, > becomes >.

I've verified this on a vanilla WordPress installation - no other plugins and the default twentytwentythree theme.

Consider the following simple component, which eventually becomes a block:

import React from 'react';

export interface IMyComponentProps {
    title: string;
}

/**
 * MyComponent example
 * @block
 */
const MyComponent: React.FC<IMyComponentProps> = ({ title }) => {
    return <div className="my-component">
        <h2>{title}</h2>
    </div>;
};

export default MyComponent;

I work in a system that allows such a component to be turned into block automatically, and the block looks like this:

If a user without unfiltered_html capability saves such a post, what gets written to the DB is the following:

<!-- wp:calian-com-components/calian-my-component {"title":"This \u0026amp; That"} -->
<div class="my-component"><h2>This &amp; That</h2></div>
<!-- /wp:calian-com-components/calian-my-component -->

In the attributes for the block, the ampersand ends up getting encoded as \u0026amp;. That's a raw value that gets passed along as attributes, so when the page refreshes in the back end, that ends up showing up as &. The immediate fallout from that is that the "Attempt Block Recovery" shows up because of the discrepancy between the & that's written to the database and the &amp; that the block now wants to render:

And once you recover the block, that & shows up in the prop for the block:

I consider this a bug because it's changing the prop that gets passed to the React component from a & to &, causing the Attempt Block Recovery, and looking ugly for the non-admin users. React components that accept props and render out markup shouldn't have to worry about HTML decoding the props it receives.

Further, I wonder about the reasoning behind writing HTML entities to the database at all, given that recommended practice is to put all text from the DB through escaping filters before displaying them. This happens for Advanced Custom Fields when I try to save & in text fields. I'm sure that's a can of worms, and I am not asking for any change there.

Step-by-step reproduction instructions

Create a simple block that has a text field property controlling one of its attributes
As a non-admin ( someone without unfiltered_html capability ), enter "This & That" into that text field and save the page.
Reload that page, and observe that attribute has now become This & That

Screenshots, screen recording, code snippet

No response

Environment info

WordPress 6.2

Please confirm that you have searched existing issues in the repo.

Yes

Please confirm that you have tested with all plugins deactivated except Gutenberg.

Yes

The text was updated successfully, but these errors were encountered:

ktmn · 2024-01-26T03:43:41Z

I was making a new issue but then found this, so I'm dumping some info that may be relevant here:

Possible culprit functions:

$input = '&';
var_dump([
	'input' => $input,
	'normalized' => wp_kses_normalize_entities($input),
	'serialized' => serialize_block_attributes($input),
	'mangled' => serialize_block_attributes(wp_kses_normalize_entities($input)),
]);

Outputs:

array(4) {
  ["input"]=> string(1) "&"
  ["normalized"]=>  string(5) "&amp;"
  ["serialized"]=> string(8) ""\u0026""
  ["mangled"]=> string(12) ""\u0026amp;""
}

Here's my reproduction instructions:

As an admin user (current_user_can('unfiltered_html')) create a post with a block whose attributes contain the character &. A core/paragraph block with the Block name Hello & world does the trick:

<!-- wp:paragraph {"metadata":{"name":"Hello \u0026 world"}} -->
<p>Test</p>
<!-- /wp:paragraph -->

The & character is \u0026 in the code editor.

As an editor user (!current_user_can('unfiltered_html')) open the post, change the paragraph text "Test" to something else, save.
Open the post for the 3rd time with any user, look in the code editor:

<!-- wp:paragraph {"metadata":{"name":"Hello \u0026amp; world"}} -->
<p>Test 2</p>
<!-- /wp:paragraph -->

\u0026 is now \u0026amp;, the Block name input now reads Hello & world instead of Hello & world.

Filter to modify unfiltered_html cap:

add_filter('user_has_cap', static function($allcaps, $caps, $args, $user) {
	if(in_array('editor', $user->roles, true)) {
		$allcaps['unfiltered_html'] = false;
	}
	return $allcaps;
}, 10, 4);

chrisvanpatten · 2024-05-07T14:00:23Z

Related to #20490. Continues to be an issue and has been now for several years.

TrevorMills · 2024-07-18T16:15:16Z

For anyone who stumbles onto this thread, here's how I ended up solving this for our client. The meat of it is this function, which aims to decode characters & and < if that's what the author geniunely meant to write while still stripping out script HTML ( meaning they still don't have the unfiltered_html capability ).

/**
 * This function is a recursive function that's part of our effort to allow some entities through, while preventing
 * others.
 *
 * We really want ACF to be able to save This & That as "This & That" into the DB, not "This &amp; That".
 *
 * Similarly, we really want ACF to be able to save Two < Three as "Two < Three", not "Two &lt; Three".
 *
 * But we don't want "&lt;script>console.log('foo');&lt;/script>" to be saved as "<script>console.log('foo')</script>"
 * because that opens up script injection.
 *
 * So, it's not enough to blindly decode all entities.
 *
 * This function is really only concerned with isolated &amp; &lt; and  &gt;, but not their numeric equivalents.  We
 * don't care about the numeric equivalents of &#38; &#60; and &#62; because if someone typed those in, that's them
 * really trying to get an encoded entity in there.  Everything else seems to be handled okay by wp_kses.
 *
 * Think of this function as being something to look for & or < or > that the author actually meant to include in
 * the text they are saving.  We want those to get into the DB as they are, not as encoded entities.
 *
 * @param mixed $props
 *
 * @return mixed
 */
function decode_isolated_entities_deep( $props ) {
	if ( is_array( $props ) ) {
		$props = array_map( 'decode_isolated_entities_deep', $props );
	} elseif ( is_object( $props ) ) {
		$props = (object) array_map( 'decode_isolated_entities_deep', (array) $props );
	} elseif ( is_string( $props ) ) {
		// First look for any occurrence of &amp; that is not following immediately by non-spaces and a semi-colon
		$props = preg_replace( '/&amp;(?!([#0-9a-zA-Z]+);)/', '&', $props );
		// Decode any &lt; that is not immediately followed by script or /script
		$props = preg_replace( '#&lt;(?!/?script)#', '<', $props );
		// Decode any &gt; that is not immediately preceded by script
		$props = preg_replace( '/(?!script)&gt;/', '>', $props );
	}
	return $props;
}

I ended up using this function for two use cases.

The first use case was wanting to write those intentional characters to the database so This & That isn't written to the DB as This & That. To achieve that, I added these filters:

if ( function_exists( 'current_user_can' ) && ! current_user_can( 'unfiltered_html' ) ) {
	// We're making a decision to NOT write encoded entities to the database for ACF field values.  I think this is best
	// decision because it means that if we need to find a post that has "This & That" for a particular field, then we
	// can just search for that, and not have to think about encoding the search parameter.  Beyond that though, this
	// ensures that editors don't see &amp; in an ACF text field if they save "This & That".
	//
	// This is only relevant for non-admins ( those without the unfiltered_html capability ).
	add_filter( 'acf/update_value', 'decode_isolated_entities_deep', 100 );
	add_filter( 'title_save_pre', 'decode_isolated_entities_deep', 100 );
	add_filter( 'content_save_pre', 'decode_isolated_entities_deep', 100 );
	add_filter( 'excerpt_save_pre', 'decode_isolated_entities_deep', 100 );
}

The second use case was when encoding the block attributes that get written as an HTML comment to be persist. This is important in case any sort of front-end hydration happens based on those attributes.

To solve this, I introduced the following function:

/**
 * In tracking down the &amp; issue within WordPress, this is where I landed.  This filter runs after all of the
 * pre_kses filters.  We're keeping those intact because we want to keep the job of stripping out invalid HTML out of
 * attributes.  Primary case is we don't want editors able to put <script> tags into a dangerouslyAllowHtml block, though
 * things like <a> and <strong> are still fine.
 *
 * The issue really just lives in the block attributes.  By the time we get to here, wp_kses has converted a single ampersand
 * into \u0026amp;, which decodes to &amp;, which will display exactly like that when passed into a prop.  A couple of other
 * characters ( namely < and > ) are similarly encoded.  This traces back to the serialize_block_attributes() function, which
 * special-cases these characters.
 *
 * So, how this filter works is we pass block attributes through decode_isolated_entities_deep() before passing then through
 * serialize_block_attributes() again to encode how WP wants them.  This will convert \u0026amp; to simply \u0026, which is what
 * we want.
 *
 * @param string         $content           Content to filter through KSES.
 * @param array[]|string $allowed_html      An array of allowed HTML elements and attributes,
 *                                          or a context name such as 'post'. See wp_kses_allowed_html()
 *                                          for the list of accepted context names.
 * @param string[]       $allowed_protocols Array of allowed URL protocols.
 *
 * @return string
 */
public function pre_kses_block_attributes( $content, $allowed_html, $allowed_protocols ) {
	if ( 'post' === $allowed_html ) {
		$content = preg_replace_callback( '/(<!-- wp:[^ ]+ )(.+?)( -->)/', function( $matches ) {
			$attributes = decode_isolated_entities_deep( json_decode( $matches[2], true ) );
			$encoded = serialize_block_attributes( $attributes );
			return "{$matches[1]}{$encoded}{$matches[3]}";
		}, $content );
	}
	return $content;
}

if ( function_exists( 'current_user_can' ) && ! current_user_can( 'unfiltered_html' ) ) {
	add_filter( 'pre_kses', 'pre_kses_block_attributes', 100, 3 );
}

Mamaduka added the Needs Technical Feedback Needs testing from a developer perspective. label Apr 25, 2023

jordesign added [Type] Bug An existing feature does not function as intended Needs Testing Needs further testing to be confirmed. labels Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gutenberg for non-admins and HTML entities #50050

Gutenberg for non-admins and HTML entities #50050

TrevorMills commented Apr 24, 2023 •

edited

Loading

ktmn commented Jan 26, 2024

chrisvanpatten commented May 7, 2024

TrevorMills commented Jul 18, 2024 •

edited

Loading

Gutenberg for non-admins and HTML entities #50050

Gutenberg for non-admins and HTML entities #50050

Comments

TrevorMills commented Apr 24, 2023 • edited Loading

Description

Step-by-step reproduction instructions

Screenshots, screen recording, code snippet

Environment info

Please confirm that you have searched existing issues in the repo.

Please confirm that you have tested with all plugins deactivated except Gutenberg.

ktmn commented Jan 26, 2024

chrisvanpatten commented May 7, 2024

TrevorMills commented Jul 18, 2024 • edited Loading

TrevorMills commented Apr 24, 2023 •

edited

Loading

TrevorMills commented Jul 18, 2024 •

edited

Loading