Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add greater than symbol to attribute escaping #9963

Merged
merged 7 commits into from
Mar 18, 2019

Conversation

johngodley
Copy link
Contributor

Although > is a valid attribute character, it is used by wptexturize in WordPress to split HTML tokens. Without escaping wptexturize will incorrectly tokenize a string, causing problems.

For example, add test > thing as the alt tag for an image in Gutenberg, and then view the post:

<img src="kirby.jpg" alt="test > thing" class="wp-image-10"/>

Results in this HTML rendered in WordPress:

<img src="kirby.jpg" alt="test > thing&#8221; class=&#8221;wp-image-10&#8243;/>

Encoding in Gutenberg prevents this problem while being transparent to the Gutenberg visual UI. It does have the effect that if you add a > into a block's attribute in HTML mode it will be replaced by &gt;, and this matches the same behaviour of & being replaced with &amp;.

The problem likely exists for other block types where an attribute is output directly based on user input. For example, custom class names.

Fixes #9915
Fixes #8779

Note this won't fix any existing posts that have a > in an attribute. Re-saving these in Gutenberg should resolve the problem.

How has this been tested?

Additional unit test added for the > symbol, and existing unit tests modified.

It can also be manually verified that adding test > thing to an image's alt tag is displayed as-is in the Gutenberg UI, and results in test &gt; thing appearing in the HTML.

<img src="kirby.jpg" alt="test &gt; thing" class="wp-image-10"/>

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.

@johngodley johngodley added the [Package] HTML entities /packages/html-entities label Sep 17, 2018
@Soean Soean added the [Type] Bug An existing feature does not function as intended label Sep 17, 2018
Copy link
Member

@jorgefilipecosta jorgefilipecosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixed the problem in my tests, and now it looks like we are doing the same the classic editor does 👍
I think we can merge this fix for now, but I wonder if this is not a bug that should be fixed in wptexturize.

@@ -295,13 +306,17 @@ export function escapeLessThan( value ) {
* "[...] the text cannot contain an ambiguous ampersand [...] must not contain
* any literal U+0022 QUOTATION MARK characters (")"
*
* Additionally we escape the greater than symbol, as this is used by wptexturize to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as @wordpress/element is concerned, it should have no knowledge of a wptexturize existing. Further, there's nothing about this module which implies any relationship to being used in the context of a post (i.e. it can be used to produce HTML which would be injected in an admin screen). These modules are meant to be independently useful without consideration being strongly bound to any one specific context. Holding ourselves honest to this guideline will prevent biasing implementation toward our specific needs.

Now, with that said, it may still be our best option moving forward, though I'm also wondering a few things:

  • Can this be fixed from wptexturize?
  • Can this be fixed by running a string replace on the result generated by a serialize call? Or do we need to be granular about the replacement occurring within an attribute value?
  • And if so, should we consider extension points to our handling of various parts of the serialization? An attribute value visitor, for example?

Considering WordPress' needs as foreign can help surface where these sorts of extension points might be broadly useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these modules are meant to be independently useful

I wasn't aware of that, but it's good to know. This did seem like a surely-this-is-too-easy kind of fix and I was expecting it to raise a question or two!

My first thought was to look at wptexturize. It's a monster of an old function and seems complicated to fix without triggering consequences somewhere. Saying that, I'm unfamiliar with the process, and don't know if Gutenberg is going to be tied to a particular base WP version etc. A fix there is certainly ideal, but tackling it in Gutenberg seems more immediately practical.

I then looked at fixing it inside the blocks themselves, but this places the burden on the block creator.

I looked at the serialize code. I did consider doing a string replace, but fixing a regex bug with another regex seemed like double trouble! It should be possible though, and we don't need to be granular about it - any unencoded > in an attribute will break the post. This could leave the core Gutenberg code to produce clean HTML, and we then have a 'texturizing' layer.

I also looked further into the serialization code and wondered if we could do it at a higher level - somehow traverse the block output and add the extra encoding to attributes. It got complicated, and I then came across escapeAttribute, which as solutions go seemed ideally magical.

By extension points do you mean hooks? That would work, and we could target different environments. If wptexturize is fixed in a future WordPress, the > encoding could be removed etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By extension points do you mean hooks? That would work, and we could target different environments. If wptexturize is fixed in a future WordPress, the > encoding could be removed etc

In my mind I had pictured something like an AST / other tree visitor, which could be passed as an option to serialize, or in the creation of a custom serializer (e.g. expose serialize factory method to create / override default behavior).

Prior art:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WordPress/gutenberg-core I know there is a core ticket for this https://core.trac.wordpress.org/ticket/45387 It doesn't seem like it has progressed much. Shall we land this fix as a preventive measure while the core fix is ongoing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These modules are meant to be independently useful without consideration being strongly bound to any one specific context.

Using > in an attribute is valid HTML. Using &gt; in an attribute is also valid HTML. Important to this discussion: the two have the same meaning.

If we're aiming to be as general as possible, we should pick &gt; since it's useable in more contexts.

Copy link
Member

@aduth aduth Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using > in an attribute is valid HTML. Using &gt; in an attribute is also valid HTML. Important to this discussion: the two have the same meaning.

I don't disagree, but:

  1. The expressed intention of this function was to apply the minimal transformations necessarily to transform an attribute value to be conformant as per the specification of what constitutes a valid attribute value. Exceptions muddy expectations, both for the consumer and for the future maintainer.
  2. The bug still remains in WordPress that valid HTML becomes mangled upon save (whether from Gutenberg or elsewhere), but because there are few who feel comfortable touching the relevant texturize code, the impetus has been placed on integrators to implement workarounds, as we've done here.
  3. None of this matters, as the pull request has been approved on the pragmatic consideration that this be solved.

Copy link
Contributor

@mdawaffe mdawaffe Mar 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Thank you for your patience and for humoring me :)

I wanted to use this package elsewhere and found I could not because of the details of its implementation. In attempting to understand why, I found this ticket. My comment was meant to be a springboard into that discussion elsewhere, but I instead derailed the conversation. My apologies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdawaffe Not at all derailed, certainly not. I think I'd stressed far too much the idea of being "generic" in my original comment, and it was a good prompt to reevaluate what I'd meant in terms of the stated purpose of a module / function.

@mtias mtias added this to the 4.6 milestone Nov 22, 2018
@johngodley johngodley force-pushed the fix/texturize-encoding branch from d675a2b to 3d339ca Compare November 23, 2018 09:15
@johngodley
Copy link
Contributor Author

Thanks for the nudge @nosolosw !

If the core ticket gets any movement then that may ultimately be the preferred route.

If not then based on Andrew's feedback it seems like a more comprehensive solution is preferred in the long run.

However, if this PR makes sense for the immediate future then I've rebased and fixed it up so it's working again. I can also create another ticket with the desired plans.

@@ -64,7 +75,7 @@ export function escapeLessThan( value ) {
* @return {string} Escaped attribute value.
*/
export function escapeAttribute( value ) {
return escapeQuotationMark( escapeAmpersand( value ) );
return escapeGreaterThan( escapeQuotationMark( escapeAmpersand( value ) ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment explaining why it diverges from HTML spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I think it got lost in the rebase. Added!

@mtias mtias modified the milestones: 4.6, 4.7 Nov 29, 2018
@youknowriad youknowriad requested a review from aduth November 30, 2018 08:12
aduth
aduth previously requested changes Nov 30, 2018
packages/escape-html/src/index.js Show resolved Hide resolved
*
* @return {string} Escaped string.
*/
export function escapeGreaterThan( value ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best if we avoided exposing this on the public interface of the module, to avoid committing ourselves to future maintenance if it's arguably a workaround for a WordPress quirk.

Noting that it's exported for use by testing, alternatives could include one of either:

  • Moving it to a separate file from which it is the default export, but is imported and then not subsequently exported by index.js
  • Naming it as an unstable API __unstableEscapeGreaterThan

*
* @return {string} Escaped string.
*/
export function __unstableEscapeGreaterThan( value ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we even exposing it? If you're exposing it for tests, I'd recommend making a separate file and not export it in index.js.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: #9963 (comment)

@mcsf
Copy link
Contributor

mcsf commented Dec 9, 2018

Punting, as we seemingly haven't reached consensus yet.

@gziolo
Copy link
Member

gziolo commented Jan 31, 2019

Punting, as we seemingly haven't reached consensus yet.

What are next steps? Should we land it or close it? What about the issues that are opened and this PR tries to resolve?

@johngodley
Copy link
Contributor Author

johngodley commented Jan 31, 2019

What are next steps? Should we land it or close it?

I have some changes to address the points about exposing unstable code. Once I've got my e2e tests working properly I'll add the changes and if it's still useful then hopefully it will push the PR forward.

@johngodley johngodley force-pushed the fix/texturize-encoding branch from 350e10a to 06f9027 Compare February 1, 2019 09:05
@johngodley
Copy link
Contributor Author

Ok, I’ve updated this so the unstable code is also in a separate file. I’ve left the tests in the same place as they seem to make more sense there. I’ve also included changes for the package version and history (pending a real date)

@gziolo gziolo removed this from the 5.2 (Gutenberg) milestone Mar 4, 2019
@youknowriad youknowriad removed this from the 5.3 (Gutenberg) milestone Mar 18, 2019
@westonruter
Copy link
Member

Has the code here been reverted in any way? When adding &gt; or &lt; into an attribute value, the serializer forces it to be > and < respectively.

<!-- wp:paragraph -->
<p><abbr title="1 > 2">False</abbr> and <abbr title="1 < 2">True</abbr></p>
<!-- /wp:paragraph -->

This gets rendered by WordPress as:

<p><abbr title="1 > 2&#8243;>False</abbr> and <abbr title="1 < 2">True</abbr></p>

The “False and” text is erroneously consumed be the attribute:

image

In this case, it seems to be a problem with wptexturize(), as it is converting the attribute closing " into &#8243;.

But if the block serializer output > as &gt; then this symptom could be avoided.

@aduth
Copy link
Member

aduth commented May 14, 2019

@westonruter Nothing more recent has occurred here to my knowledge. Could you create an issue which includes version information and sample code? I might wonder how your abbr elements are generated could have an impact on the serialization.

@westonruter
Copy link
Member

@aduth cheers. Added here: #15636.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Package] HTML entities /packages/html-entities [Type] Bug An existing feature does not function as intended
Projects
None yet