Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gutenberg for non-admins and HTML entities #50050

Open
TrevorMills opened this issue Apr 24, 2023 · 3 comments
Open

Gutenberg for non-admins and HTML entities #50050

TrevorMills opened this issue Apr 24, 2023 · 3 comments
Labels
Needs Technical Feedback Needs testing from a developer perspective. Needs Testing Needs further testing to be confirmed. [Type] Bug An existing feature does not function as intended

Comments

@TrevorMills
Copy link

TrevorMills commented Apr 24, 2023

Description

As a non-administrator ( someone without unfiltered_html capability ), when I save a prop to a React based component, some characters are getting converted to their encoded versions - & becomes &, > becomes >.

I've verified this on a vanilla WordPress installation - no other plugins and the default twentytwentythree theme.

Consider the following simple component, which eventually becomes a block:

import React from 'react';

export interface IMyComponentProps {
    title: string;
}

/**
 * MyComponent example
 * @block
 */
const MyComponent: React.FC<IMyComponentProps> = ({ title }) => {
    return <div className="my-component">
        <h2>{title}</h2>
    </div>;
};

export default MyComponent;

I work in a system that allows such a component to be turned into block automatically, and the block looks like this:

image

If a user without unfiltered_html capability saves such a post, what gets written to the DB is the following:

<!-- wp:calian-com-components/calian-my-component {"title":"This \u0026amp; That"} -->
<div class="my-component"><h2>This &amp; That</h2></div>
<!-- /wp:calian-com-components/calian-my-component -->

In the attributes for the block, the ampersand ends up getting encoded as \u0026amp;. That's a raw value that gets passed along as attributes, so when the page refreshes in the back end, that ends up showing up as &amp;. The immediate fallout from that is that the "Attempt Block Recovery" shows up because of the discrepancy between the &amp; that's written to the database and the &amp;amp; that the block now wants to render:

image

And once you recover the block, that &amp; shows up in the prop for the block:

image

I consider this a bug because it's changing the prop that gets passed to the React component from a & to &amp;, causing the Attempt Block Recovery, and looking ugly for the non-admin users. React components that accept props and render out markup shouldn't have to worry about HTML decoding the props it receives.

Further, I wonder about the reasoning behind writing HTML entities to the database at all, given that recommended practice is to put all text from the DB through escaping filters before displaying them. This happens for Advanced Custom Fields when I try to save & in text fields. I'm sure that's a can of worms, and I am not asking for any change there.

Step-by-step reproduction instructions

  1. Create a simple block that has a text field property controlling one of its attributes
  2. As a non-admin ( someone without unfiltered_html capability ), enter "This & That" into that text field and save the page.
  3. Reload that page, and observe that attribute has now become This &amp; That

Screenshots, screen recording, code snippet

No response

Environment info

  • WordPress 6.2

Please confirm that you have searched existing issues in the repo.

Yes

Please confirm that you have tested with all plugins deactivated except Gutenberg.

Yes

@Mamaduka Mamaduka added the Needs Technical Feedback Needs testing from a developer perspective. label Apr 25, 2023
@jordesign jordesign added [Type] Bug An existing feature does not function as intended Needs Testing Needs further testing to be confirmed. labels Jul 20, 2023
@ktmn
Copy link

ktmn commented Jan 26, 2024

I was making a new issue but then found this, so I'm dumping some info that may be relevant here:

Possible culprit functions:

$input = '&';
var_dump([
	'input' => $input,
	'normalized' => wp_kses_normalize_entities($input),
	'serialized' => serialize_block_attributes($input),
	'mangled' => serialize_block_attributes(wp_kses_normalize_entities($input)),
]);

Outputs:

array(4) {
  ["input"]=> string(1) "&"
  ["normalized"]=>  string(5) "&amp;"
  ["serialized"]=> string(8) ""\u0026""
  ["mangled"]=> string(12) ""\u0026amp;""
}

Here's my reproduction instructions:

  1. As an admin user (current_user_can('unfiltered_html')) create a post with a block whose attributes contain the character &. A core/paragraph block with the Block name Hello & world does the trick:
<!-- wp:paragraph {"metadata":{"name":"Hello \u0026 world"}} -->
<p>Test</p>
<!-- /wp:paragraph -->

The & character is \u0026 in the code editor.

  1. As an editor user (!current_user_can('unfiltered_html')) open the post, change the paragraph text "Test" to something else, save.
  2. Open the post for the 3rd time with any user, look in the code editor:
<!-- wp:paragraph {"metadata":{"name":"Hello \u0026amp; world"}} -->
<p>Test 2</p>
<!-- /wp:paragraph -->

\u0026 is now \u0026amp;, the Block name input now reads Hello &amp; world instead of Hello & world.

Filter to modify unfiltered_html cap:

add_filter('user_has_cap', static function($allcaps, $caps, $args, $user) {
	if(in_array('editor', $user->roles, true)) {
		$allcaps['unfiltered_html'] = false;
	}
	return $allcaps;
}, 10, 4);

@chrisvanpatten
Copy link
Contributor

Related to #20490. Continues to be an issue and has been now for several years.

@TrevorMills
Copy link
Author

TrevorMills commented Jul 18, 2024

For anyone who stumbles onto this thread, here's how I ended up solving this for our client. The meat of it is this function, which aims to decode characters & and < if that's what the author geniunely meant to write while still stripping out script HTML ( meaning they still don't have the unfiltered_html capability ).

/**
 * This function is a recursive function that's part of our effort to allow some entities through, while preventing
 * others.
 *
 * We really want ACF to be able to save This & That as "This & That" into the DB, not "This &amp; That".
 *
 * Similarly, we really want ACF to be able to save Two < Three as "Two < Three", not "Two &lt; Three".
 *
 * But we don't want "&lt;script>console.log('foo');&lt;/script>" to be saved as "<script>console.log('foo')</script>"
 * because that opens up script injection.
 *
 * So, it's not enough to blindly decode all entities.
 *
 * This function is really only concerned with isolated &amp; &lt; and  &gt;, but not their numeric equivalents.  We
 * don't care about the numeric equivalents of &#38; &#60; and &#62; because if someone typed those in, that's them
 * really trying to get an encoded entity in there.  Everything else seems to be handled okay by wp_kses.
 *
 * Think of this function as being something to look for & or < or > that the author actually meant to include in
 * the text they are saving.  We want those to get into the DB as they are, not as encoded entities.
 *
 * @param mixed $props
 *
 * @return mixed
 */
function decode_isolated_entities_deep( $props ) {
	if ( is_array( $props ) ) {
		$props = array_map( 'decode_isolated_entities_deep', $props );
	} elseif ( is_object( $props ) ) {
		$props = (object) array_map( 'decode_isolated_entities_deep', (array) $props );
	} elseif ( is_string( $props ) ) {
		// First look for any occurrence of &amp; that is not following immediately by non-spaces and a semi-colon
		$props = preg_replace( '/&amp;(?!([#0-9a-zA-Z]+);)/', '&', $props );
		// Decode any &lt; that is not immediately followed by script or /script
		$props = preg_replace( '#&lt;(?!/?script)#', '<', $props );
		// Decode any &gt; that is not immediately preceded by script
		$props = preg_replace( '/(?!script)&gt;/', '>', $props );
	}
	return $props;
}

I ended up using this function for two use cases.

The first use case was wanting to write those intentional characters to the database so This & That isn't written to the DB as This &amp; That. To achieve that, I added these filters:

if ( function_exists( 'current_user_can' ) && ! current_user_can( 'unfiltered_html' ) ) {
	// We're making a decision to NOT write encoded entities to the database for ACF field values.  I think this is best
	// decision because it means that if we need to find a post that has "This & That" for a particular field, then we
	// can just search for that, and not have to think about encoding the search parameter.  Beyond that though, this
	// ensures that editors don't see &amp; in an ACF text field if they save "This & That".
	//
	// This is only relevant for non-admins ( those without the unfiltered_html capability ).
	add_filter( 'acf/update_value', 'decode_isolated_entities_deep', 100 );
	add_filter( 'title_save_pre', 'decode_isolated_entities_deep', 100 );
	add_filter( 'content_save_pre', 'decode_isolated_entities_deep', 100 );
	add_filter( 'excerpt_save_pre', 'decode_isolated_entities_deep', 100 );
}

The second use case was when encoding the block attributes that get written as an HTML comment to be persist. This is important in case any sort of front-end hydration happens based on those attributes.

To solve this, I introduced the following function:

/**
 * In tracking down the &amp; issue within WordPress, this is where I landed.  This filter runs after all of the
 * pre_kses filters.  We're keeping those intact because we want to keep the job of stripping out invalid HTML out of
 * attributes.  Primary case is we don't want editors able to put <script> tags into a dangerouslyAllowHtml block, though
 * things like <a> and <strong> are still fine.
 *
 * The issue really just lives in the block attributes.  By the time we get to here, wp_kses has converted a single ampersand
 * into \u0026amp;, which decodes to &amp;, which will display exactly like that when passed into a prop.  A couple of other
 * characters ( namely < and > ) are similarly encoded.  This traces back to the serialize_block_attributes() function, which
 * special-cases these characters.
 *
 * So, how this filter works is we pass block attributes through decode_isolated_entities_deep() before passing then through
 * serialize_block_attributes() again to encode how WP wants them.  This will convert \u0026amp; to simply \u0026, which is what
 * we want.
 *
 * @param string         $content           Content to filter through KSES.
 * @param array[]|string $allowed_html      An array of allowed HTML elements and attributes,
 *                                          or a context name such as 'post'. See wp_kses_allowed_html()
 *                                          for the list of accepted context names.
 * @param string[]       $allowed_protocols Array of allowed URL protocols.
 *
 * @return string
 */
public function pre_kses_block_attributes( $content, $allowed_html, $allowed_protocols ) {
	if ( 'post' === $allowed_html ) {
		$content = preg_replace_callback( '/(<!-- wp:[^ ]+ )(.+?)( -->)/', function( $matches ) {
			$attributes = decode_isolated_entities_deep( json_decode( $matches[2], true ) );
			$encoded = serialize_block_attributes( $attributes );
			return "{$matches[1]}{$encoded}{$matches[3]}";
		}, $content );
	}
	return $content;
}

if ( function_exists( 'current_user_can' ) && ! current_user_can( 'unfiltered_html' ) ) {
	add_filter( 'pre_kses', 'pre_kses_block_attributes', 100, 3 );
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Technical Feedback Needs testing from a developer perspective. Needs Testing Needs further testing to be confirmed. [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

No branches or pull requests

5 participants