Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WordPress installs with non-UTF-8 charsets #855

Closed
westonruter opened this issue Jan 11, 2018 · 19 comments · Fixed by #3758 or #4296
Closed

Support WordPress installs with non-UTF-8 charsets #855

westonruter opened this issue Jan 11, 2018 · 19 comments · Fixed by #3758 or #4296
Assignees
Milestone

Comments

@westonruter
Copy link
Member

westonruter commented Jan 11, 2018

The AMP spec requires <meta charset="utf-8"> and this is naturally problematic for sites that still use a Latin1 charset. We need to do a few things:

  • For existing paired mode theme system, we'll need to use mb_convert_encoding() from the get_bloginfo('charset') to utf-8 when reading from the DB. We'll also need to forcibly add header( 'Content-Type: text/html; charset=utf-8' ) to override what WordPress is sending by default.
  • With canonical AMP, we'll be output buffering the response and so we could convert the encoding at the time of output. We'll still need to be wary of \AMP_DOM_Utils::get_dom_from_content() in how it is currently assuming UTF-8 content for the input.

https://github.com/Automattic/amp-wp/blob/e60cb152a50e2ffbf5504d41aee859ad1d0e0baa/includes/class-amp-theme-support.php#L192

@westonruter
Copy link
Member Author

I tried to fix this as part of #888 but I failed. My abandoned patch with lots of debug code: https://gist.github.com/westonruter/04d479e809409e1f12a5944701f6f24f

The problem came down to wptexturize() actually. When it inserts &#8220; and &#8221; for and respectively, these get added converted to UTF-8 encoded chars, and then mb_convert_encoding() is failing to map the curly quotes between encodings, it seems:

<?php
$opening_quote = '&#8220;';
$closing_quote = '&#8221;';
$output = html_entity_decode( $opening_quote . 'Hello World' . $closing_quote, ENT_QUOTES, 'UTF-8' );
$output = mb_convert_encoding( $output, 'ISO-8859-1', 'UTF-8' );
echo $output;

Output is:

00000000: 3f48 656c 6c6f 2057 6f72 6c64 3f         ?Hello World?

The left and right double quotes are both getting converted into ? due to an error apparently.

I wasn't able to prevent DOMDocument from converting the entities when loadHTML'ing either.

@schlessera
Copy link
Collaborator

schlessera commented Dec 18, 2019

I think this should be fixed now via #3758 and the automated conversion happening inside of Amp\AmpWP\Dom\Document.

@csossi
Copy link

csossi commented Feb 13, 2020

Please add QA instructions, @schlessera

@schlessera
Copy link
Collaborator

@csossi I fail to come up with an easy way to test this on the staging server, it would mean changing the default encoding of the server to test it.

@westonruter Do you have a suggestion on how we can test this in QA? Or do you think the automated tests in place are enough in this case?

@westonruter
Copy link
Member Author

@schlessera Yeah, I don't think there's a way to test this on staging. The only way to test would be to create a new WordPress install with a non-UTF-8 charset, populate content with non-ASCII characters, and then verify they pass through as UTF-8 without mojibake. So this would need to be developer QA, most likely.

@csossi
Copy link

csossi commented Feb 13, 2020

Moving to "Done"

@csossi csossi added the QA passed Has passed QA and is done label Feb 13, 2020
@westonruter
Copy link
Member Author

Was it QA'd though? I'd like @kienstra perhaps to check this in a dev environment.

@westonruter westonruter removed the QA passed Has passed QA and is done label Feb 13, 2020
@csossi
Copy link

csossi commented Feb 13, 2020

Ah, ok - moving back - @kienstra - can you take a look?

@kienstra
Copy link
Contributor

Sure, I'll test this after the sync

@kienstra
Copy link
Contributor

Testing

Hi @westonruter,

Sorry for the delay here.

It looks like a site with a charset of latin1 (ISO-8859-1) has HTML entities and other characters like ← that don't look the same in AMP.

Still, I don't know how meaningful this is, or how realistic these steps to reproduce are. Especially step 7 below.

There might not be many people who set the blog_charset option to ISO-8859-1. It looks like that can't be done in the UI, at least with the current WP.

This issue doesn't occur if get_bloginfo( 'charset' ) returns 'utf-8' and wp-config.php has define( 'DB_CHARSET', 'latin1' ).

It only occurs if the option was somehow changed to be ISO-8859-1.

Steps To Reproduce

  1. wp core download, or whatever you use to create a new WP instance
  2. Change wp-config-sample.php to wp-config.php, and enter:
define( 'DB_CHARSET', 'latin1' );
  1. Make any other changes to the config needed to connect to the DB
  2. wp db create
  3. wp core install --prompt # or however you install WP
  4. Log in to /wp-admin
  5. wp option update blog_charset ISO-8859-1 # latin1 charset
  6. wp plugin install amp --activate
  7. Create a new post with a Custom HTML block
  8. Enter some HTML entities like this in the block:
Some A Characters:
&#194; &#195; &#196; 
Paragraph:
&#182;
Reserved (non HTML entity):
®
  1. Save, and go to the front-end:

characters-appear-as-expected

  1. Select Transitional mode, and go to the AMP URL

  2. Expected: the characters look the same

  3. Actual: they're distorted:

diff

@westonruter
Copy link
Member Author

Thank you very much. This appears to not be working.

@westonruter westonruter reopened this Feb 14, 2020
@kienstra kienstra removed their assignment Feb 14, 2020
@schlessera
Copy link
Collaborator

Thanks for the review, @kienstra ! I'll use your steps to recreate on my end and investigate.

@schlessera
Copy link
Collaborator

@kienstra You mention in your instruction wp plugin install amp --activate... Does that mean you've tested this without the PR to test included, maybe?

@schlessera
Copy link
Collaborator

The problem seems to persist on develop branch, though...

@schlessera
Copy link
Collaborator

Tracking the debugging process here.

First of all, my database now shows the following:
Image 2020-02-15 at 3 06 39 PM
Image 2020-02-15 at 3 06 51 PM
Image 2020-02-15 at 3 07 07 PM

The non-Amp version shows the correct blog charset:
Image 2020-02-15 at 3 08 07 PM

Within the debugger, I can see that the adapt_encoding() method turns this post content:

Image 2020-02-15 at 3 14 24 PM

into this post content:

Image 2020-02-15 at 3 14 45 PM

The encoding properties in the generated DOMDocument all show "UTF-8":
Image 2020-02-15 at 3 18 07 PM

The attached meta tag is correct:
Image 2020-02-15 at 3 20 47 PM

So, the logic seems to correctly identify the charset, correctly converts it into UTF-8 and correctly parses it into the DOMDocument as UTF-8.

The post content within the DOM is correct as well:
Image 2020-02-15 at 3 22 59 PM

Given the above, and considering the fact that the tests pass, I assume for now that all of the logic in the DOMDocument abstraction works for converting the encoding automatically, but that WordPress Core itself sees its own charset still set to latin-1 and tries to correct that (and thus breaks the content which is now UTF-8).

I'll look into what WordPress does now.

@schlessera
Copy link
Collaborator

Another hint that WordPress is the one messing it up:
Image 2020-02-15 at 3 27 07 PM

@schlessera
Copy link
Collaborator

I think I found the culprit. We're not adapting the HTTP response header charset that is being sent out. So the browser converts from latin-1 to UTF-8 to correct:
Image 2020-02-15 at 3 29 52 PM

@schlessera
Copy link
Collaborator

Alright, enforcing the UTF-8 charset fixed it, I'll submit a PR.
Image 2020-02-15 at 4 05 48 PM

@schlessera
Copy link
Collaborator

Image 2020-02-15 at 4 09 02 PM
Image 2020-02-15 at 4 09 17 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants