Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paste schema (fix various issues, simplify) #5966

Merged
merged 2 commits into from
May 2, 2018
Merged

Conversation

ellatrix
Copy link
Member

@ellatrix ellatrix commented Apr 3, 2018

Description

This branch simplifies whitelisting and reduces the amount of filters. The idea is that there is one schema with all possible paths that a pasted node tree can have. If something is different from what is possible, the node should unwrap or be deleted.

Why adding a schema with an infinite amount of paths? The whitelisting we had before was just too simplistic. It lacks context. For example, a list (OL) can only contain list items (LI), a figure caption can only be part of a figure etc. The schema also takes care of a number of filters we had before by the generalisation, e.g. stripping text nodes out of tables where they don't belong, stripping comments, and finally it also incorporates attribute stripping.

Instead of looping three times over the whole pasted node tree, we are now only looping twice. One filters and adjusts broken HTML, from the bottom of the tree to the top, and the other one removes invalid HTML according to the schema, from the top to the bottom of the tree.

Another big difference from master is that the schema is now part of the transforms that blocks register, not of the raw handler module itself. This is quite a nice separation and it also enables block authors to register more schemas to be taken into account when converting old content.

Also new is the selector property on transforms. In a lot of cases in can be used in favour of the isMatch( node ) function.

I also got rid of the inInvalidInline filter which was looping through all nodes inside a loop. 🙈

Additionally I tried to simplify and optimize some of the rest of the code, and add some doc blocks.

How Has This Been Tested?

Paste tests should pass.

Fixes case where a loose figcaption is kept and put in a classic block. The expected result is the figcaption element to unwrap. Try pasting from https://www.nytimes.com/2018/03/28/magazine/poem-the-nod.html.

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code has proper inline documentation.

@ellatrix ellatrix added [Status] In Progress Tracking issues with work in progress [Feature] Paste labels Apr 3, 2018
@ellatrix ellatrix force-pushed the try/paste-whitelist branch 3 times, most recently from 0dcddaf to 07664af Compare April 4, 2018 14:54
@ellatrix ellatrix changed the title Redo paste whitelisting Add paste schema (fix various issues, simplify) Apr 4, 2018
@ellatrix ellatrix force-pushed the try/paste-whitelist branch from 0eab0e8 to 961e82b Compare April 4, 2018 16:24
@ellatrix ellatrix requested a review from mcsf April 4, 2018 16:26
@ellatrix ellatrix force-pushed the try/paste-whitelist branch from 961e82b to 1b3134a Compare April 4, 2018 18:11
@ellatrix ellatrix removed the [Status] In Progress Tracking issues with work in progress label Apr 4, 2018
@ellatrix ellatrix requested a review from a team April 4, 2018 21:30
@ellatrix ellatrix force-pushed the try/paste-whitelist branch from 1b3134a to 0c7575e Compare April 11, 2018 15:51
@ellatrix
Copy link
Member Author

Rebased.

@ellatrix ellatrix added this to the 2.7 milestone Apr 12, 2018
Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is big and there's quite a bit to review, so I'll be making other passes. :) That said, it's looking quite nice; I'm all for simplifying raw-handling.

Concerning your testing notes, copying the Times' poem only works if I copy the poem itself (with title), but if I copy a larger chunk of the page, the poem is missing:

Copied Pasted
screen shot 2018-04-13 at 13 02 08 screen shot 2018-04-13 at 13 02 14
screen shot 2018-04-13 at 13 00 51 screen shot 2018-04-13 at 13 00 33

utils/dom.js Outdated
@@ -441,3 +441,41 @@ export function remove( node ) {
export function insertAfter( newNode, referenceNode ) {
referenceNode.parentNode.insertBefore( newNode, referenceNode.nextSibling );
}

/**
* Unwrap the givin node. This means any child nodes are moved to the parent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/givin/given

whitelist[ tag ].attributes &&
whitelist[ tag ].attributes.indexOf( attribute ) !== -1
);
export function getContentSchema( { iframe } = { iframe: true } ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function signature is tricky to me, because of the following table:

  • getContentSchema(): iframe equals true (expectable)
  • getContentSchema( { foo: 'bar', iframe: val } ): iframe equals val (expectable)
  • getContentSchema( { foo: 'bar' } ): iframe equals false (...huh?)

A fix would be to write the signature as getContentSchema( { iframe = true } ) {, though that requires passing an object (or false) on every call. Addressing both issues, getContentSchema( { iframe = true } = {} ) is also possible, though more cryptic. :)


if ( node.nodeName === 'SPAN' ) {
const fontWeight = node.style.fontWeight;
const fontStyle = node.style.fontStyle;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: const { fontWeight, fontStyle } = node.style;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust. Just copied the old code.

@ellatrix
Copy link
Member Author

Good catch about copying the whole page. The issue is that the poem is wrapped in a figure, so we're dropping anything we don't expect in there. I'll adjust so we require embedded content in a figure.

@ellatrix
Copy link
Member Author

All issues addressed.

@ellatrix ellatrix requested review from mcsf and a team April 13, 2018 13:45
@ellatrix
Copy link
Member Author

I'll add a test for 315819d.

@greatislander
Copy link
Contributor

greatislander commented Apr 13, 2018

@iseulde As per Slack discussion with @aduth, I'm wondering if this refactor supports the following use case:

In Classic Editor, I have a custom div with a class (<div class="special-block">) which can contain various elements. I've created a custom block and a transform which will convert any instances of these divs to InnerBlocks. What happens with Gutenberg 2.6 is that given this input:

<div class="special-block">
<p>Some text.</p>
</div>

When I select "Convert to Blocks" in Gutenberg, the wrapper div is stripped:

<!-- wp:paragraph -->
<p>Some text.</p>
<!-- /wp:paragraph -->

@aduth mentioned that this is probably intentional behaviour within the raw handler, so as to remove excess markup from Word, Google Docs, etc. I'm wondering if there is (or could be) a way of whitelisting certain custom elements (e.g., don't strip <div class="special-element"> if a transform is defined which looks for it). Thanks in advance, happy to provide further information if needed.

@ellatrix
Copy link
Member Author

@greatislander That's a great question! We don't currently support that in master, nor did I add support in this branch, but I agree it's something that we should think about. I think this PR will help making that easier to do. I can imagine that we end up moving every schema item to the transforms themselves.

Would you mind creating a separate issue for this? Once this is merged I'll look into it.

@ellatrix
Copy link
Member Author

Actually, since this is unlikely to merge this week, I might end up playing with splitting the schema over the weekend...

@greatislander
Copy link
Contributor

Sure thing!

@ellatrix
Copy link
Member Author

So the idea is that instead of having "matchers" like ( node ) => /H\d/.test( node.nodeName ) we'd let the blocks add all the pieces of schema. Working on this now... One difficult point is figuring out which transform to use if there could be multiple schemas used... I was thinking we could register them as selectors like div.special-block: { /* schema */ }.

@greatislander
Copy link
Contributor

@iseulde That sounds excellent. One related issue I opened previously was #6020. I don't know how much that is within the scope of this project, but thought I'd mention it.

@greatislander
Copy link
Contributor

Specific use case for #6020: the ability to migrate the contents of a shortcode as InnerBlocks when the shortcode contains multiple elements. So, a shortcode transform that handles InnerBlocks.

@jasmussen
Copy link
Contributor

Very probably unrelated to this branch, so apologies, but I saw mention of shortcodes and pasting, and I have a question about that. Right now anything written in square brackets is detected as a shortcode. For example, the following breaks into three blocks, one of them a shortcode block:

This is some text, and [this is some text in brackets]. 

screen shot 2018-04-17 at 08 43 17

This is not the end of the world as brackets aren't heavily used typographicalluy, but it's also not ideal. Is there a way we can enhance the shortcode detection regex to look for shortcodes of a specific pattern and extract only those? Even if we didn't detect all shortcodes, that would probably be okay since shortcodes work inside normal paragraphs too.

@ellatrix
Copy link
Member Author

ellatrix commented Apr 17, 2018

@jasmussen I'll look into this after this PR. This one doesn't really touch shortcodes at all, but it's all good to keep in mind while redoing this, thanks. Pushing to this branch later today with the ability to extend, just need to clean up the mess. :)

@ellatrix
Copy link
Member Author

An issue I'm having with moving all the schemas to the blocks is that the tests now have an implicit dependency on the blocks...

@ellatrix ellatrix force-pushed the try/paste-whitelist branch from b8d9713 to 17176a6 Compare April 17, 2018 20:31
@greatislander
Copy link
Contributor

@iseulde Works for me! This is fantastic. Thanks for taking my suggestion and running with it 👍

@mcsf mcsf added this to the 2.8 milestone Apr 18, 2018
@ellatrix
Copy link
Member Author

I think this is ready for another review.

@ellatrix ellatrix removed the [Status] In Progress Tracking issues with work in progress label Apr 18, 2018
@ellatrix
Copy link
Member Author

@aduth Added a Markdown integration test in dceb20e. Note that this runs through the serialiser, so the result is "beautiful". But as you can see the list block can parse the list items correctly. Hope this helps, having an extra integration text can't hurt anyway. :)

@ellatrix ellatrix force-pushed the try/paste-whitelist branch from dceb20e to c5a23d8 Compare April 19, 2018 16:07
@ellatrix
Copy link
Member Author

Rebased.

Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good work, and the added tests, including integration tests, are welcome. This is a partial round of feedback; I'll circle back soon.

}

return schema.figure.children.hasOwnProperty( tag );
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least upon first review, the relationship between figure/figcaption and the idea of embedded content isn't clear when reading the @see reference. I might change my mind as I keep reviewing.

Edit: tests like https://github.com/WordPress/gutenberg/pull/5966/files#diff-5be16f01c683c067fee5722d0851f860R17 help understand, but maybe we can make the goal or premises of this clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more information here.

.map( ( transform ) => ( {
isMatch: ( node ) => transform.selector && node.matches( transform.selector ),
...transform,
} ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the FP/immutable thinking, but I'm wondering if we can be a bit more efficient here, e.g.

.map( ( transform ) => {
  if ( transform.isMatch ) {
    return transform;
  }
  return {
    ...transform,
    isMatch: ( node ) => ,
  }
} )

if ( ! canUserUseUnfilteredHTML ) {
filters.unshift( ( node ) =>
node.nodeName === 'iframe' && unwrap( node )
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a filter on its own, with a name, etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it can be. :)

);
return Array.from( doc.body.children ).map( ( node ) => {
const { transform, blockName } =
find( rawTransformations, ( { isMatch } ) => isMatch( node ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the const assignment destructured if find returns undefined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: later, what are we creating in createBlock( blockName, … ) if we didn't find a transformation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen as there will not be any top level tags left that are not defined in schemas. That said, better to be prudent, and not error? Maybe we can add a doing it wrong message instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted so that if someone forgets to add a selector:

A block registered a raw transformation schema for H2 but did not match it. Make sure there is a selector or isMatch property that can match the schema.
Sanitized HTML: <h2>Issue Overview</h2>


/**
* An array of tag groups used by isInlineForTag function.
* If tagName and nodeName are present in the same group, the node should be treated as inline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The references to tagName and nodeName don't make sense here. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this code from utils, didn't write this myself. No excuse though. I'll see what I can improve.

return nodes.every( ( node ) =>
isInline( node, tagName ) && deepCheck( Array.from( node.children ), tagName )
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally minor and tangent: this can be abstracted as a general tree traversal predicate — rough example:

deepCheck( nodes, predicate ) {
  return nodes.every( ( node ) =>
    predicate( node ) && deepCheck( Array.from( node.children ), predicate )
  );
}

I only bring this up because other pieces of the project use some form of tree-like manipulation (e.g. buildTermsTree) and having some place to keep these more generic functions could be nice in the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand from the code example but will have a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't care about this in this PR, or even in the next. :)


if ( node.nodeName === 'I' ) {
node = replaceTag( node, 'em', doc );
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but the two previous if-statements could be else if.

function isEmbedded( node, schema ) {
const tag = node.nodeName.toLowerCase();

if ( ! schema.figure || tag === 'figcaption' || isPhrasingContent( node ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glancing at the name of this function and the referenced concept of "Embedded content", I am confused why there is any mention of figure and figcaption here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Gutenberg, we wrap all embedded content in figures. Paste also outputs this format for consistency.

@@ -41,7 +64,13 @@ export default function( node ) {
wrapper = wrapper.parentElement;
}

const figure = doc.createElement( 'figure' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note: What does a figure have to do with anything here?

*
* @return {string} HTML only containing phrasing content.
*/
function filterInlineHTML( HTML ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument is not technically camel-cased†. I would expect lower-case html.

† Well, apparently this is up for debate: https://en.wikipedia.org/wiki/Camel_case But I think of initial capital as PascalCase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML? I never really write HTML lowercased. Happy to change though. There are a number of other places to where it is written like this.

// Allows us to ask for this information when we get a report.
window.console.log( 'Processed inline HTML:\n\n', HTML );
if ( mode === 'INLINE' ) {
return filterInlineHTML( HTML );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Why do we return here so late? Much wasted effort in converting shortcodes if we don't need to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Remnant from previous code structure I guess.

createUnwrapper( isInvalidInline ),
] );
if ( ! canUserUseUnfilteredHTML ) {
filters.unshift( ( node ) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does order matter here? Array#unshift is much slower than Array#push, if we can just push instead:

https://jsperf.com/array-push-vs-unshift-vs-direct-assignment/2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Should happen before embeddedContentReducer. I'll leave a comment.

const nodeName = node.nodeName.toLowerCase();
return inlineWhitelist.hasOwnProperty( nodeName ) || isInlineForTag( nodeName, tagName );
}
export function getPhrasingContentSchema() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we call this? Does it need to be a function, or can we construct this (or at least parts of it, like the base schema) as a constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I wrapped it in a function so that it cannot be changed by block authors for the whole application. I guess we can calculate once and return the constant.

sub: {},
sup: {},
br: {},
[ TEXT_NODE ]: {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit odd to be mixing node names and numeric constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative here is #text which is any text node's nodeName. I might actually prefer this.


// If it's plain text, there should only be one node left.
return doc.body.childNodes.length === 1 && doc.body.firstChild.nodeType === TEXT_NODE;
return ! /<(?!br)/i.test( HTML );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isPlain( '<brazenly-defying-odds-with-my-custom-element />' );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no. :) I'll adjust and add a test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

} );

// Strip invalid classes.
const newClasses = classes.filter( ( name ) => node.classList.contains( name ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check to see whether there are even any classes assigned before running through this filter / removing the attribute?

https://developer.mozilla.org/en-US/docs/Web/API/DOMTokenList/length

}

if ( node.hasChildNodes() ) {
// Contine if the node is supposed to have children.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "Contine" -> "Continue"

@ellatrix ellatrix force-pushed the try/paste-whitelist branch 3 times, most recently from 47611e1 to c63f77d Compare April 20, 2018 19:23
@ellatrix
Copy link
Member Author

Addressed all actionable feedback and rebased.

@ellatrix
Copy link
Member Author

Rebased with the latest block changes.

Comment, test, clean up

Do not allow figures without embedded content

Adjust getContentSchema signature

Destructure where possible

Fix typo

Add test for 315819d

Move schemas to blocks

Simplify

Restore iframe filter

Add Markdown integration test

Address feedback

Separate Markdown converter

Remove unneeded nodeType checks
@ellatrix ellatrix force-pushed the try/paste-whitelist branch from e93177e to 038f086 Compare May 2, 2018 13:22
Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive work. We need to solve for IE before merging. After this PR, some docs are in order to explain the schema subsystem, how getPhrasingContentSchema is useful for third-party blocks, etc.

Once IE is handled, :shipit: 🚢

schema: {
ol: listContentSchema.ol,
ul: listContentSchema.ul,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just be schema: listContentSchema, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, no, it would contain phrasing content too. List content can contain phrasing content and lists.


// Recursion is needed.
// Possible: ul > li > ul.
// Impossible: ul > ul.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

.map( ( transform ) => {
return transform.isMatch ? transform : {
...transform,
isMatch: ( node ) => transform.selector && node.matches( transform.selector ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Element#matches isn't supported on IE. Per Slack, resorting to a lib to polyfill may be in order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the block library we can just keep using nodeName for the simple checks and avoid pulling in the polyfill and confuse block authors.

@ellatrix
Copy link
Member Author

ellatrix commented May 2, 2018

Thank you @mcsf! Will create some docs for raw transforms in a separate PR.

@ellatrix ellatrix merged commit 37a03da into master May 2, 2018
@ellatrix ellatrix deleted the try/paste-whitelist branch May 2, 2018 22:52
@mtias
Copy link
Member

mtias commented May 3, 2018

Great work here, @iseulde 🎈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants