Skip to content

Commit

Permalink
Avoid lossy html entities encoding by setting charset (#24645)
Browse files Browse the repository at this point in the history
This change specifies the content type and charset of the html passed into `DomDocument` as `utf-8`.

Replaces the `mb_convert_encoding` call which encodes `UTF-8` as `HTML-ENTITIES` before handing off to `DomDocument`.

This change avoids the need to later revert the encoding back to `UTF-8` afterwards using `mb_convert_encoding`. This secondary `mb_convert_encoding` call was converting not only the `UTF-8` characters that were converted earlier but also any pre-existing entity encoded html stored inside block content.

This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way [Jetpack is injecting](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L434) html into the [`data-image-description` attributes](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L485). 

There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation.

Co-authored-by: Bernie Reiter <ockham@raz.or.at>
  • Loading branch information
2 people authored and jorgefilipecosta committed Aug 19, 2020
1 parent 2b6f92c commit 4e692d1
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 7 deletions.
10 changes: 7 additions & 3 deletions lib/block-supports/index.php
Original file line number Diff line number Diff line change
Expand Up @@ -52,17 +52,21 @@ function gutenberg_apply_block_supports( $block_content, $block ) {
return $block_content;
}

// We need to wrap the block in order to handle UTF-8 properly.
$wrapper_left = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>';
$wrapper_right = '</body></html>';

$dom = new DOMDocument( '1.0', 'utf-8' );

// Suppress warnings from this method from polluting the front-end.
// @codingStandardsIgnoreStart
if ( ! @$dom->loadHTML( mb_convert_encoding( $block_content, 'HTML-ENTITIES', 'UTF-8' ), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_COMPACT ) ) {
if ( ! @$dom->loadHTML( $wrapper_left . $block_content . $wrapper_right , LIBXML_HTML_NODEFDTD | LIBXML_COMPACT ) ) {
// @codingStandardsIgnoreEnd
return $block_content;
}

$xpath = new DOMXPath( $dom );
$block_root = $xpath->query( '/*' )[0];
$block_root = $xpath->query( '/html/body/*' )[0];

if ( empty( $block_root ) ) {
return $block_content;
Expand All @@ -86,7 +90,7 @@ function gutenberg_apply_block_supports( $block_content, $block ) {
$block_root->setAttribute( 'style', esc_attr( implode( '; ', $new_styles ) . ';' ) );
}

return mb_convert_encoding( $dom->saveHtml(), 'UTF-8', 'HTML-ENTITIES' );
return str_replace( array( $wrapper_left, $wrapper_right ), '', $dom->saveHtml() );
}
add_filter( 'render_block', 'gutenberg_apply_block_supports', 10, 2 );

Expand Down
14 changes: 10 additions & 4 deletions phpunit/class-block-supported-styles-test.php
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@ private function get_attribute_from_block( $attribute, $block ) {
* @param string $block String of rendered block to check.
*/
private function get_content_from_block( $block ) {
$start_index = strpos( $block, '>' ) + 1;
$start_index = strpos( $block, '>' ) + 1; // First occurrence of '>'.
$split_arr = substr( $block, $start_index );
$end_index = strpos( $split_arr, '<' );
return substr( $split_arr, 0, $end_index );
$end_index = strrpos( $split_arr, '<' ); // Last occurrence of '<'.
return substr( $split_arr, 0, $end_index ); // String between first '>' and last '<'.
}

/**
Expand Down Expand Up @@ -117,7 +117,13 @@ private function assert_content_and_styles_and_classes_match( $block, $expected_
*
* @var string
*/
const BLOCK_CONTENT = 'Some non-Latin chärs to make sure DOM öperations don\'t mess them up: こんにちは';
const BLOCK_CONTENT = '
<p data-image-description="&lt;p&gt;Test!&lt;/p&gt;">Test</p>
<p>äöü</p>
<p>ß</p>
<p>系の家庭に</p>
<p>Example &lt;p&gt;Test!&lt;/p&gt;</p>
';

/**
* Example block markup string to test with.
Expand Down

0 comments on commit 4e692d1

Please sign in to comment.