Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block API: Consider encoding-normalized text as equivalent #11771

Merged
merged 2 commits into from
Nov 15, 2018

Conversation

aduth
Copy link
Member

@aduth aduth commented Nov 12, 2018

Fixes #9906

This pull request seeks to improve the block validation step to allow more leniency for effectively equivalent text encoded in varying forms.

The changes here were authored in such a way where there may be a slight performance benefit over master, both in a reduction of bundle size (an approximate 18% reduction gzipped on the blocks module) and in optimizing for an early return of equality if normalization (whitespace or encoding) is not necessary to determine equivalence of text sequences.

Implementation notes:

In the process of implementing further text normalization here, it was discovered that the underlying simple-html-tokenizer performs its own entities substitution when encountering text tokens in an HTML string. For the purposes of validation, this was considered to be redundant and was thus swapped with a stub entity parser in the included changes. Note that this is the change which enables the significant drop in bundle size. Note also as an aside that there's desire to consolidate to a single parser between the blocks parse and validator parse, so the use of simple-html-tokenizer may or may not persist far into the future.

Testing instructions:

Verify that block invalidation is not triggered by encoding variations.

For example, inserting the following HTML as the contents of a post (in Text Mode, Classic Editor Text tab, or directly in the database) should not be presented as an invalid block when next viewing the Visual Mode of the editor:

<!-- wp:paragraph -->
<p>This works. &#128517;</p>
<!-- /wp:paragraph -->

cc @MarkRH

@aduth aduth added [Feature] Block API API that allows to express the block paradigm. Framework Issues related to broader framework topics, especially as it relates to javascript [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Nov 12, 2018
@aduth aduth requested a review from pento November 12, 2018 17:54
@aduth
Copy link
Member Author

aduth commented Nov 12, 2018

The current test failures are legitimate. In disabling simple-html-tokenizer's built-in entity normalization, it surfaces more differences, notably in attributes values, which aren't covered by the normalizations now taking place in isEquivalentTextTokens for text.

I'll need to think more on how best to address this, because the normalizations aren't strictly the same for text and attributes. Further, we have a few specific handlers on attribute value equivalence (e.g. class, style).

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities. This would still fix the issue, and retain the bundle size reduction, but perhaps at some cost of performance; this is arguable though, as it only impacts encoded HTML.

@aduth
Copy link
Member Author

aduth commented Nov 13, 2018

One option may be to switch back to relying on simple-html-tokenizer's EntityParser, but still substituting the implementation to apply our own decodeEntities.

This is what I decided to do in the latest commit.

@youknowriad youknowriad modified the milestones: 4.4, 4.5 Nov 15, 2018
Copy link
Contributor

@youknowriad youknowriad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@youknowriad youknowriad modified the milestones: 4.5, 4.4 Nov 15, 2018
@youknowriad youknowriad merged commit 1237243 into master Nov 15, 2018
@youknowriad youknowriad deleted the fix/9906-emoji-validation branch November 15, 2018 16:25
vindl added a commit to Automattic/wp-calypso that referenced this pull request Nov 20, 2018

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
Related to this change in Gutenberg:
WordPress/gutenberg#11771
@aduth aduth added the [Feature] Block Validation/Deprecation Handling block validation to determine accuracy and deprecation label Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] Block Validation/Deprecation Handling block validation to determine accuracy and deprecation [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f Framework Issues related to broader framework topics, especially as it relates to javascript
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gutenberg 3.8 - 4.1.1 Paragraph Block Thinks Modified Externally if an Emoji Is Used.
3 participants