Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Avoid lossy html entities encoding by setting charset (#24645)
This change specifies the content type and charset of the html passed into `DomDocument` as `utf-8`. Replaces the `mb_convert_encoding` call which encodes `UTF-8` as `HTML-ENTITIES` before handing off to `DomDocument`. This change avoids the need to later revert the encoding back to `UTF-8` afterwards using `mb_convert_encoding`. This secondary `mb_convert_encoding` call was converting not only the `UTF-8` characters that were converted earlier but also any pre-existing entity encoded html stored inside block content. This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way [Jetpack is injecting](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L434) html into the [`data-image-description` attributes](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L485). There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation. Co-authored-by: Bernie Reiter <ockham@raz.or.at>
- Loading branch information