Add paste schema (fix various issues, simplify) #5966

ellatrix · 2018-04-03T22:56:05Z

Description

This branch simplifies whitelisting and reduces the amount of filters. The idea is that there is one schema with all possible paths that a pasted node tree can have. If something is different from what is possible, the node should unwrap or be deleted.

Why adding a schema with an infinite amount of paths? The whitelisting we had before was just too simplistic. It lacks context. For example, a list (OL) can only contain list items (LI), a figure caption can only be part of a figure etc. The schema also takes care of a number of filters we had before by the generalisation, e.g. stripping text nodes out of tables where they don't belong, stripping comments, and finally it also incorporates attribute stripping.

Instead of looping three times over the whole pasted node tree, we are now only looping twice. One filters and adjusts broken HTML, from the bottom of the tree to the top, and the other one removes invalid HTML according to the schema, from the top to the bottom of the tree.

Another big difference from master is that the schema is now part of the transforms that blocks register, not of the raw handler module itself. This is quite a nice separation and it also enables block authors to register more schemas to be taken into account when converting old content.

Also new is the selector property on transforms. In a lot of cases in can be used in favour of the isMatch( node ) function.

I also got rid of the inInvalidInline filter which was looping through all nodes inside a loop. 🙈

Additionally I tried to simplify and optimize some of the rest of the code, and add some doc blocks.

How Has This Been Tested?

Paste tests should pass.

Fixes case where a loose figcaption is kept and put in a classic block. The expected result is the figcaption element to unwrap. Try pasting from https://www.nytimes.com/2018/03/28/magazine/poem-the-nod.html.

Checklist:

My code is tested.
My code follows the WordPress code style.
My code has proper inline documentation.

ellatrix · 2018-04-11T15:58:23Z

Rebased.

mcsf

This is big and there's quite a bit to review, so I'll be making other passes. :) That said, it's looking quite nice; I'm all for simplifying raw-handling.

Concerning your testing notes, copying the Times' poem only works if I copy the poem itself (with title), but if I copy a larger chunk of the page, the poem is missing:

Copied	Pasted

mcsf · 2018-04-13T11:18:11Z

utils/dom.js

@@ -441,3 +441,41 @@ export function remove( node ) {
 export function insertAfter( newNode, referenceNode ) {
 	referenceNode.parentNode.insertBefore( newNode, referenceNode.nextSibling );
 }
+
+/**
+ * Unwrap the givin node. This means any child nodes are moved to the parent.


s/givin/given

mcsf · 2018-04-13T11:40:56Z

blocks/api/raw-handling/utils.js

-		whitelist[ tag ].attributes &&
-		whitelist[ tag ].attributes.indexOf( attribute ) !== -1
-	);
+export function getContentSchema( { iframe } = { iframe: true } ) {


This function signature is tricky to me, because of the following table:

getContentSchema(): iframe equals true (expectable)

getContentSchema( { foo: 'bar', iframe: val } ): iframe equals val (expectable)

getContentSchema( { foo: 'bar' } ): iframe equals false (...huh?)

A fix would be to write the signature as getContentSchema( { iframe = true } ) {, though that requires passing an object (or false) on every call. Addressing both issues, getContentSchema( { iframe = true } = {} ) is also possible, though more cryptic. :)

mcsf · 2018-04-13T11:52:37Z

blocks/api/raw-handling/phrasing-content-reducer.js

+
+	if ( node.nodeName === 'SPAN' ) {
+		const fontWeight = node.style.fontWeight;
+		const fontStyle = node.style.fontStyle;


Minor: const { fontWeight, fontStyle } = node.style;

I'll adjust. Just copied the old code.

ellatrix · 2018-04-13T13:09:24Z

Good catch about copying the whole page. The issue is that the poem is wrapped in a figure, so we're dropping anything we don't expect in there. I'll adjust so we require embedded content in a figure.

ellatrix · 2018-04-13T13:44:48Z

All issues addressed.

ellatrix · 2018-04-13T13:46:46Z

I'll add a test for 315819d.

greatislander · 2018-04-13T18:20:26Z

@iseulde As per Slack discussion with @aduth, I'm wondering if this refactor supports the following use case:

In Classic Editor, I have a custom div with a class (<div class="special-block">) which can contain various elements. I've created a custom block and a transform which will convert any instances of these divs to InnerBlocks. What happens with Gutenberg 2.6 is that given this input:

<div class="special-block">
<p>Some text.</p>
</div>

When I select "Convert to Blocks" in Gutenberg, the wrapper div is stripped:

<!-- wp:paragraph -->
<p>Some text.</p>
<!-- /wp:paragraph -->

@aduth mentioned that this is probably intentional behaviour within the raw handler, so as to remove excess markup from Word, Google Docs, etc. I'm wondering if there is (or could be) a way of whitelisting certain custom elements (e.g., don't strip <div class="special-element"> if a transform is defined which looks for it). Thanks in advance, happy to provide further information if needed.

ellatrix · 2018-04-13T19:54:18Z

@greatislander That's a great question! We don't currently support that in master, nor did I add support in this branch, but I agree it's something that we should think about. I think this PR will help making that easier to do. I can imagine that we end up moving every schema item to the transforms themselves.

Would you mind creating a separate issue for this? Once this is merged I'll look into it.

ellatrix · 2018-04-13T19:58:24Z

Actually, since this is unlikely to merge this week, I might end up playing with splitting the schema over the weekend...

greatislander · 2018-04-13T22:10:00Z

Sure thing!

ellatrix · 2018-04-16T13:58:59Z

So the idea is that instead of having "matchers" like ( node ) => /H\d/.test( node.nodeName ) we'd let the blocks add all the pieces of schema. Working on this now... One difficult point is figuring out which transform to use if there could be multiple schemas used... I was thinking we could register them as selectors like div.special-block: { /* schema */ }.

greatislander · 2018-04-16T18:00:09Z

@iseulde That sounds excellent. One related issue I opened previously was #6020. I don't know how much that is within the scope of this project, but thought I'd mention it.

greatislander · 2018-04-16T18:00:54Z

Specific use case for #6020: the ability to migrate the contents of a shortcode as InnerBlocks when the shortcode contains multiple elements. So, a shortcode transform that handles InnerBlocks.

jasmussen · 2018-04-17T06:46:05Z

Very probably unrelated to this branch, so apologies, but I saw mention of shortcodes and pasting, and I have a question about that. Right now anything written in square brackets is detected as a shortcode. For example, the following breaks into three blocks, one of them a shortcode block:

This is some text, and [this is some text in brackets].

This is not the end of the world as brackets aren't heavily used typographicalluy, but it's also not ideal. Is there a way we can enhance the shortcode detection regex to look for shortcodes of a specific pattern and extract only those? Even if we didn't detect all shortcodes, that would probably be okay since shortcodes work inside normal paragraphs too.

ellatrix · 2018-04-17T08:17:01Z

@jasmussen I'll look into this after this PR. This one doesn't really touch shortcodes at all, but it's all good to keep in mind while redoing this, thanks. Pushing to this branch later today with the ability to extend, just need to clean up the mess. :)

ellatrix · 2018-04-17T08:26:30Z

An issue I'm having with moving all the schemas to the blocks is that the tests now have an implicit dependency on the blocks...

greatislander · 2018-04-17T21:20:33Z

@iseulde Works for me! This is fantastic. Thanks for taking my suggestion and running with it 👍

ellatrix · 2018-04-18T18:17:26Z

I think this is ready for another review.

ellatrix · 2018-04-19T15:57:34Z

@aduth Added a Markdown integration test in dceb20e. Note that this runs through the serialiser, so the result is "beautiful". But as you can see the list block can parse the list items correctly. Hope this helps, having an extra integration text can't hurt anyway. :)

ellatrix · 2018-04-19T16:07:48Z

Rebased.

mcsf

This is good work, and the added tests, including integration tests, are welcome. This is a partial round of feedback; I'll circle back soon.

mcsf · 2018-04-19T17:22:24Z

blocks/api/raw-handling/embedded-content-reducer.js

+	}
+
+	return schema.figure.children.hasOwnProperty( tag );
+}


At least upon first review, the relationship between figure/figcaption and the idea of embedded content isn't clear when reading the @see reference. I might change my mind as I keep reviewing.

Edit: tests like https://github.com/WordPress/gutenberg/pull/5966/files#diff-5be16f01c683c067fee5722d0851f860R17 help understand, but maybe we can make the goal or premises of this clearer.

I'll add more information here.

mcsf · 2018-04-19T17:31:18Z

blocks/api/raw-handling/index.js

+		.map( ( transform ) => ( {
+			isMatch: ( node ) => transform.selector && node.matches( transform.selector ),
+			...transform,
+		} ) );


I appreciate the FP/immutable thinking, but I'm wondering if we can be a bit more efficient here, e.g.

.map( ( transform ) => { if ( transform.isMatch ) { return transform; } return { ...transform, isMatch: ( node ) => …, } } )

mcsf · 2018-04-19T17:37:35Z

blocks/api/raw-handling/index.js

+		if ( ! canUserUseUnfilteredHTML ) {
+			filters.unshift( ( node ) =>
+				node.nodeName === 'iframe' && unwrap( node )
+			);


Should this be a filter on its own, with a name, etc.?

Sure, it can be. :)

mcsf · 2018-04-19T17:43:23Z

blocks/api/raw-handling/index.js

-				);
+		return Array.from( doc.body.children ).map( ( node ) => {
+			const { transform, blockName } =
+				find( rawTransformations, ( { isMatch } ) => isMatch( node ) );


How is the const assignment destructured if find returns undefined?

Related: later, what are we creating in createBlock( blockName, … ) if we didn't find a transformation?

This should never happen as there will not be any top level tags left that are not defined in schemas. That said, better to be prudent, and not error? Maybe we can add a doing it wrong message instead.

Adjusted so that if someone forgets to add a selector:

A block registered a raw transformation schema for H2 but did not match it. Make sure there is a selector or isMatch property that can match the schema.
Sanitized HTML: <h2>Issue Overview</h2>

mcsf · 2018-04-19T17:49:14Z

blocks/api/raw-handling/is-inline-content.js

+
+/**
+ * An array of tag groups used by isInlineForTag function.
+ * If tagName and nodeName are present in the same group, the node should be treated as inline.


The references to tagName and nodeName don't make sense here. :)

I copied this code from utils, didn't write this myself. No excuse though. I'll see what I can improve.

mcsf · 2018-04-19T17:56:09Z

blocks/api/raw-handling/is-inline-content.js

+	return nodes.every( ( node ) =>
+		isInline( node, tagName ) && deepCheck( Array.from( node.children ), tagName )
+	);
+}


Totally minor and tangent: this can be abstracted as a general tree traversal predicate — rough example:

deepCheck( nodes, predicate ) { return nodes.every( ( node ) => predicate( node ) && deepCheck( Array.from( node.children ), predicate ) ); }

I only bring this up because other pieces of the project use some form of tree-like manipulation (e.g. buildTermsTree) and having some place to keep these more generic functions could be nice in the long term.

Not sure I understand from the code example but will have a look.

I wouldn't care about this in this PR, or even in the next. :)

mcsf · 2018-04-19T18:01:23Z

blocks/api/raw-handling/phrasing-content-reducer.js

+
+	if ( node.nodeName === 'I' ) {
+		node = replaceTag( node, 'em', doc );
+	}


Minor, but the two previous if-statements could be else if.

aduth · 2018-04-19T17:40:54Z

blocks/api/raw-handling/embedded-content-reducer.js

+function isEmbedded( node, schema ) {
+	const tag = node.nodeName.toLowerCase();
+
+	if ( ! schema.figure || tag === 'figcaption' || isPhrasingContent( node ) ) {


Glancing at the name of this function and the referenced concept of "Embedded content", I am confused why there is any mention of figure and figcaption here.

In Gutenberg, we wrap all embedded content in figures. Paste also outputs this format for consistency.

aduth · 2018-04-19T17:44:12Z

blocks/api/raw-handling/embedded-content-reducer.js

@@ -41,7 +64,13 @@ export default function( node ) {
 		wrapper = wrapper.parentElement;
 	}

+	const figure = doc.createElement( 'figure' );


Same note: What does a figure have to do with anything here?

aduth · 2018-04-19T17:45:43Z

blocks/api/raw-handling/index.js

+ *
+ * @return {string} HTML only containing phrasing content.
+ */
+function filterInlineHTML( HTML ) {


The argument is not technically camel-cased†. I would expect lower-case html.

† Well, apparently this is up for debate: https://en.wikipedia.org/wiki/Camel_case But I think of initial capital as PascalCase.

HTML? I never really write HTML lowercased. Happy to change though. There are a number of other places to where it is written like this.

aduth · 2018-04-19T17:49:44Z

blocks/api/raw-handling/index.js

-		// Allows us to ask for this information when we get a report.
-		window.console.log( 'Processed inline HTML:\n\n', HTML );
+	if ( mode === 'INLINE' ) {
+		return filterInlineHTML( HTML );


Minor: Why do we return here so late? Much wasted effort in converting shortcodes if we don't need to.

Good point. Remnant from previous code structure I guess.

aduth · 2018-04-19T17:52:00Z

blocks/api/raw-handling/index.js

-			createUnwrapper( isInvalidInline ),
-		] );
+		if ( ! canUserUseUnfilteredHTML ) {
+			filters.unshift( ( node ) =>


Does order matter here? Array#unshift is much slower than Array#push, if we can just push instead:

https://jsperf.com/array-push-vs-unshift-vs-direct-assignment/2

Yes. Should happen before embeddedContentReducer. I'll leave a comment.

aduth · 2018-04-19T18:17:36Z

blocks/api/raw-handling/utils.js

-	const nodeName = node.nodeName.toLowerCase();
-	return inlineWhitelist.hasOwnProperty( nodeName ) || isInlineForTag( nodeName, tagName );
-}
+export function getPhrasingContentSchema() {


How often do we call this? Does it need to be a function, or can we construct this (or at least parts of it, like the base schema) as a constant?

Good point! I wrapped it in a function so that it cannot be changed by block authors for the whole application. I guess we can calculate once and return the constant.

aduth · 2018-04-19T18:20:36Z

blocks/api/raw-handling/utils.js

+		sub: {},
+		sup: {},
+		br: {},
+		[ TEXT_NODE ]: {},


A bit odd to be mixing node names and numeric constants.

The alternative here is #text which is any text node's nodeName. I might actually prefer this.

aduth · 2018-04-19T18:26:09Z

blocks/api/raw-handling/utils.js

-
-	// If it's plain text, there should only be one node left.
-	return doc.body.childNodes.length === 1 && doc.body.firstChild.nodeType === TEXT_NODE;
+	return ! /<(?!br)/i.test( HTML );


isPlain( '<brazenly-defying-odds-with-my-custom-element />' );

Oh no. :) I'll adjust and add a test case.

aduth · 2018-04-19T18:29:43Z

blocks/api/raw-handling/utils.js

+				} );
+
+				// Strip invalid classes.
+				const newClasses = classes.filter( ( name ) => node.classList.contains( name ) );


Should we check to see whether there are even any classes assigned before running through this filter / removing the attribute?

https://developer.mozilla.org/en-US/docs/Web/API/DOMTokenList/length

aduth · 2018-04-19T18:30:03Z

blocks/api/raw-handling/utils.js

+				}
+
+				if ( node.hasChildNodes() ) {
+					// Contine if the node is supposed to have children.


Typo: "Contine" -> "Continue"

ellatrix · 2018-04-20T19:24:39Z

Addressed all actionable feedback and rebased.

ellatrix · 2018-04-25T18:01:49Z

Rebased with the latest block changes.

Comment, test, clean up Do not allow figures without embedded content Adjust getContentSchema signature Destructure where possible Fix typo Add test for 315819d Move schemas to blocks Simplify Restore iframe filter Add Markdown integration test Address feedback Separate Markdown converter Remove unneeded nodeType checks

mcsf

This is impressive work. We need to solve for IE before merging. After this PR, some docs are in order to explain the schema subsystem, how getPhrasingContentSchema is useful for third-party blocks, etc.

Once IE is handled, 🚢

mcsf · 2018-05-02T16:32:29Z

core-blocks/list/index.js

+				schema: {
+					ol: listContentSchema.ol,
+					ul: listContentSchema.ul,
+				},


Should just be schema: listContentSchema, no?

Hm, no, it would contain phrasing content too. List content can contain phrasing content and lists.

mcsf · 2018-05-02T16:32:41Z

core-blocks/list/index.js

+
+// Recursion is needed.
+// Possible: ul > li > ul.
+// Impossible: ul > ul.


mcsf · 2018-05-02T16:41:09Z

blocks/api/raw-handling/index.js

+		.map( ( transform ) => {
+			return transform.isMatch ? transform : {
+				...transform,
+				isMatch: ( node ) => transform.selector && node.matches( transform.selector ),


It seems like Element#matches isn't supported on IE. Per Slack, resorting to a lib to polyfill may be in order.

Will adjust.

Maybe in the block library we can just keep using nodeName for the simple checks and avoid pulling in the polyfill and confuse block authors.

ellatrix · 2018-05-02T21:41:08Z

Thank you @mcsf! Will create some docs for raw transforms in a separate PR.

mtias · 2018-05-03T14:26:29Z

Great work here, @iseulde 🎈

ellatrix added [Status] In Progress Tracking issues with work in progress [Feature] Paste labels Apr 3, 2018

ellatrix force-pushed the try/paste-whitelist branch 3 times, most recently from 0dcddaf to 07664af Compare April 4, 2018 14:54

ellatrix changed the title ~~Redo paste whitelisting~~ Add paste schema (fix various issues, simplify) Apr 4, 2018

ellatrix force-pushed the try/paste-whitelist branch from 0eab0e8 to 961e82b Compare April 4, 2018 16:24

ellatrix requested a review from mcsf April 4, 2018 16:26

ellatrix force-pushed the try/paste-whitelist branch from 961e82b to 1b3134a Compare April 4, 2018 18:11

ellatrix removed the [Status] In Progress Tracking issues with work in progress label Apr 4, 2018

ellatrix requested a review from a team April 4, 2018 21:30

ellatrix force-pushed the try/paste-whitelist branch from 1b3134a to 0c7575e Compare April 11, 2018 15:51

ellatrix added this to the 2.7 milestone Apr 12, 2018

mcsf reviewed Apr 13, 2018

View reviewed changes

ellatrix requested review from mcsf and a team April 13, 2018 13:45

ellatrix force-pushed the try/paste-whitelist branch from b8d9713 to 17176a6 Compare April 17, 2018 20:31

mcsf added this to the 2.8 milestone Apr 18, 2018

ellatrix removed the [Status] In Progress Tracking issues with work in progress label Apr 18, 2018

ellatrix force-pushed the try/paste-whitelist branch from dceb20e to c5a23d8 Compare April 19, 2018 16:07

mcsf reviewed Apr 19, 2018

View reviewed changes

aduth reviewed Apr 19, 2018

View reviewed changes

ellatrix force-pushed the try/paste-whitelist branch 3 times, most recently from 47611e1 to c63f77d Compare April 20, 2018 19:23

ellatrix force-pushed the try/paste-whitelist branch from 630afc1 to c485fbd Compare April 21, 2018 10:14

This was referenced Apr 23, 2018

Normalize characters with combining marks to correctly composed characters #433

Closed

No Paste as Text button on Paragraph block #6132

Closed

ellatrix force-pushed the try/paste-whitelist branch 2 times, most recently from 71a1e3d to e93177e Compare April 25, 2018 18:00

ellatrix force-pushed the try/paste-whitelist branch from e93177e to 038f086 Compare May 2, 2018 13:22

mcsf reviewed May 2, 2018

View reviewed changes

Polyfill Element#matches

5c1c5dd

ellatrix merged commit 37a03da into master May 2, 2018

ellatrix deleted the try/paste-whitelist branch May 2, 2018 22:52

aduth mentioned this pull request May 3, 2018

Transform into the correct embed block based on URL patterns #5315

Merged

ellatrix mentioned this pull request May 3, 2018

Restore priority on embed block for raw transforming #6572

Merged

4 tasks

Add paste schema (fix various issues, simplify) #5966

Add paste schema (fix various issues, simplify) #5966

Conversation

ellatrix commented Apr 3, 2018 • edited Loading

Description

How Has This Been Tested?

Checklist:

ellatrix commented Apr 11, 2018

mcsf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix commented Apr 13, 2018

ellatrix commented Apr 13, 2018

ellatrix commented Apr 13, 2018

greatislander commented Apr 13, 2018 • edited Loading

ellatrix commented Apr 13, 2018

ellatrix commented Apr 13, 2018

greatislander commented Apr 13, 2018

ellatrix commented Apr 16, 2018

greatislander commented Apr 16, 2018

greatislander commented Apr 16, 2018

jasmussen commented Apr 17, 2018

ellatrix commented Apr 17, 2018 • edited Loading

ellatrix commented Apr 17, 2018

greatislander commented Apr 17, 2018

ellatrix commented Apr 18, 2018

ellatrix commented Apr 19, 2018

ellatrix commented Apr 19, 2018

mcsf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix commented Apr 20, 2018

ellatrix commented Apr 25, 2018

mcsf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellatrix commented May 2, 2018

mtias commented May 3, 2018

ellatrix commented Apr 3, 2018 •

edited

Loading

greatislander commented Apr 13, 2018 •

edited

Loading

ellatrix commented Apr 17, 2018 •

edited

Loading