-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add paste schema (fix various issues, simplify) #5966
Conversation
0dcddaf
to
07664af
Compare
0eab0e8
to
961e82b
Compare
961e82b
to
1b3134a
Compare
1b3134a
to
0c7575e
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is big and there's quite a bit to review, so I'll be making other passes. :) That said, it's looking quite nice; I'm all for simplifying raw-handling
.
Concerning your testing notes, copying the Times' poem only works if I copy the poem itself (with title), but if I copy a larger chunk of the page, the poem is missing:
Copied | Pasted |
---|---|
utils/dom.js
Outdated
@@ -441,3 +441,41 @@ export function remove( node ) { | |||
export function insertAfter( newNode, referenceNode ) { | |||
referenceNode.parentNode.insertBefore( newNode, referenceNode.nextSibling ); | |||
} | |||
|
|||
/** | |||
* Unwrap the givin node. This means any child nodes are moved to the parent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/givin/given
blocks/api/raw-handling/utils.js
Outdated
whitelist[ tag ].attributes && | ||
whitelist[ tag ].attributes.indexOf( attribute ) !== -1 | ||
); | ||
export function getContentSchema( { iframe } = { iframe: true } ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function signature is tricky to me, because of the following table:
getContentSchema()
:iframe
equalstrue
(expectable)getContentSchema( { foo: 'bar', iframe: val } )
:iframe
equalsval
(expectable)getContentSchema( { foo: 'bar' } )
:iframe
equalsfalse
(...huh?)
A fix would be to write the signature as getContentSchema( { iframe = true } ) {
, though that requires passing an object (or false
) on every call. Addressing both issues, getContentSchema( { iframe = true } = {} )
is also possible, though more cryptic. :)
|
||
if ( node.nodeName === 'SPAN' ) { | ||
const fontWeight = node.style.fontWeight; | ||
const fontStyle = node.style.fontStyle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: const { fontWeight, fontStyle } = node.style;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll adjust. Just copied the old code.
Good catch about copying the whole page. The issue is that the poem is wrapped in a figure, so we're dropping anything we don't expect in there. I'll adjust so we require embedded content in a figure. |
All issues addressed. |
I'll add a test for 315819d. |
@iseulde As per Slack discussion with @aduth, I'm wondering if this refactor supports the following use case: In Classic Editor, I have a custom div with a class (
When I select "Convert to Blocks" in Gutenberg, the wrapper div is stripped:
@aduth mentioned that this is probably intentional behaviour within the raw handler, so as to remove excess markup from Word, Google Docs, etc. I'm wondering if there is (or could be) a way of whitelisting certain custom elements (e.g., don't strip |
@greatislander That's a great question! We don't currently support that in master, nor did I add support in this branch, but I agree it's something that we should think about. I think this PR will help making that easier to do. I can imagine that we end up moving every schema item to the transforms themselves. Would you mind creating a separate issue for this? Once this is merged I'll look into it. |
Actually, since this is unlikely to merge this week, I might end up playing with splitting the schema over the weekend... |
Sure thing! |
So the idea is that instead of having "matchers" like |
Specific use case for #6020: the ability to migrate the contents of a shortcode as InnerBlocks when the shortcode contains multiple elements. So, a shortcode transform that handles InnerBlocks. |
Very probably unrelated to this branch, so apologies, but I saw mention of shortcodes and pasting, and I have a question about that. Right now anything written in square brackets is detected as a shortcode. For example, the following breaks into three blocks, one of them a shortcode block:
This is not the end of the world as brackets aren't heavily used typographicalluy, but it's also not ideal. Is there a way we can enhance the shortcode detection regex to look for shortcodes of a specific pattern and extract only those? Even if we didn't detect all shortcodes, that would probably be okay since shortcodes work inside normal paragraphs too. |
@jasmussen I'll look into this after this PR. This one doesn't really touch shortcodes at all, but it's all good to keep in mind while redoing this, thanks. Pushing to this branch later today with the ability to extend, just need to clean up the mess. :) |
An issue I'm having with moving all the schemas to the blocks is that the tests now have an implicit dependency on the blocks... |
b8d9713
to
17176a6
Compare
@iseulde Works for me! This is fantastic. Thanks for taking my suggestion and running with it 👍 |
I think this is ready for another review. |
@aduth Added a Markdown integration test in dceb20e. Note that this runs through the serialiser, so the result is "beautiful". But as you can see the list block can parse the list items correctly. Hope this helps, having an extra integration text can't hurt anyway. :) |
dceb20e
to
c5a23d8
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good work, and the added tests, including integration tests, are welcome. This is a partial round of feedback; I'll circle back soon.
} | ||
|
||
return schema.figure.children.hasOwnProperty( tag ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least upon first review, the relationship between figure
/figcaption
and the idea of embedded content isn't clear when reading the @see
reference. I might change my mind as I keep reviewing.
Edit: tests like https://github.com/WordPress/gutenberg/pull/5966/files#diff-5be16f01c683c067fee5722d0851f860R17 help understand, but maybe we can make the goal or premises of this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add more information here.
blocks/api/raw-handling/index.js
Outdated
.map( ( transform ) => ( { | ||
isMatch: ( node ) => transform.selector && node.matches( transform.selector ), | ||
...transform, | ||
} ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the FP/immutable thinking, but I'm wondering if we can be a bit more efficient here, e.g.
.map( ( transform ) => {
if ( transform.isMatch ) {
return transform;
}
return {
...transform,
isMatch: ( node ) => …,
}
} )
blocks/api/raw-handling/index.js
Outdated
if ( ! canUserUseUnfilteredHTML ) { | ||
filters.unshift( ( node ) => | ||
node.nodeName === 'iframe' && unwrap( node ) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a filter on its own, with a name, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it can be. :)
blocks/api/raw-handling/index.js
Outdated
); | ||
return Array.from( doc.body.children ).map( ( node ) => { | ||
const { transform, blockName } = | ||
find( rawTransformations, ( { isMatch } ) => isMatch( node ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the const
assignment destructured if find
returns undefined
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: later, what are we creating in createBlock( blockName, … )
if we didn't find a transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen as there will not be any top level tags left that are not defined in schemas. That said, better to be prudent, and not error? Maybe we can add a doing it wrong message instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted so that if someone forgets to add a selector:
A block registered a raw transformation schema for
H2
but did not match it. Make sure there is aselector
orisMatch
property that can match the schema.
Sanitized HTML:<h2>Issue Overview</h2>
|
||
/** | ||
* An array of tag groups used by isInlineForTag function. | ||
* If tagName and nodeName are present in the same group, the node should be treated as inline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The references to tagName
and nodeName
don't make sense here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this code from utils
, didn't write this myself. No excuse though. I'll see what I can improve.
return nodes.every( ( node ) => | ||
isInline( node, tagName ) && deepCheck( Array.from( node.children ), tagName ) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally minor and tangent: this can be abstracted as a general tree traversal predicate — rough example:
deepCheck( nodes, predicate ) {
return nodes.every( ( node ) =>
predicate( node ) && deepCheck( Array.from( node.children ), predicate )
);
}
I only bring this up because other pieces of the project use some form of tree-like manipulation (e.g. buildTermsTree
) and having some place to keep these more generic functions could be nice in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand from the code example but will have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't care about this in this PR, or even in the next. :)
|
||
if ( node.nodeName === 'I' ) { | ||
node = replaceTag( node, 'em', doc ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but the two previous if-statements could be else if
.
function isEmbedded( node, schema ) { | ||
const tag = node.nodeName.toLowerCase(); | ||
|
||
if ( ! schema.figure || tag === 'figcaption' || isPhrasingContent( node ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glancing at the name of this function and the referenced concept of "Embedded content", I am confused why there is any mention of figure
and figcaption
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Gutenberg, we wrap all embedded content in figures. Paste also outputs this format for consistency.
@@ -41,7 +64,13 @@ export default function( node ) { | |||
wrapper = wrapper.parentElement; | |||
} | |||
|
|||
const figure = doc.createElement( 'figure' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note: What does a figure have to do with anything here?
* | ||
* @return {string} HTML only containing phrasing content. | ||
*/ | ||
function filterInlineHTML( HTML ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument is not technically camel-cased†. I would expect lower-case html
.
† Well, apparently this is up for debate: https://en.wikipedia.org/wiki/Camel_case But I think of initial capital as PascalCase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML
? I never really write HTML lowercased. Happy to change though. There are a number of other places to where it is written like this.
blocks/api/raw-handling/index.js
Outdated
// Allows us to ask for this information when we get a report. | ||
window.console.log( 'Processed inline HTML:\n\n', HTML ); | ||
if ( mode === 'INLINE' ) { | ||
return filterInlineHTML( HTML ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Why do we return here so late? Much wasted effort in converting shortcodes if we don't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Remnant from previous code structure I guess.
blocks/api/raw-handling/index.js
Outdated
createUnwrapper( isInvalidInline ), | ||
] ); | ||
if ( ! canUserUseUnfilteredHTML ) { | ||
filters.unshift( ( node ) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does order matter here? Array#unshift
is much slower than Array#push
, if we can just push instead:
https://jsperf.com/array-push-vs-unshift-vs-direct-assignment/2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Should happen before embeddedContentReducer
. I'll leave a comment.
blocks/api/raw-handling/utils.js
Outdated
const nodeName = node.nodeName.toLowerCase(); | ||
return inlineWhitelist.hasOwnProperty( nodeName ) || isInlineForTag( nodeName, tagName ); | ||
} | ||
export function getPhrasingContentSchema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do we call this? Does it need to be a function, or can we construct this (or at least parts of it, like the base schema) as a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I wrapped it in a function so that it cannot be changed by block authors for the whole application. I guess we can calculate once and return the constant.
blocks/api/raw-handling/utils.js
Outdated
sub: {}, | ||
sup: {}, | ||
br: {}, | ||
[ TEXT_NODE ]: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit odd to be mixing node names and numeric constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative here is #text
which is any text node's nodeName
. I might actually prefer this.
blocks/api/raw-handling/utils.js
Outdated
|
||
// If it's plain text, there should only be one node left. | ||
return doc.body.childNodes.length === 1 && doc.body.firstChild.nodeType === TEXT_NODE; | ||
return ! /<(?!br)/i.test( HTML ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isPlain( '<brazenly-defying-odds-with-my-custom-element />' );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no. :) I'll adjust and add a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😃
blocks/api/raw-handling/utils.js
Outdated
} ); | ||
|
||
// Strip invalid classes. | ||
const newClasses = classes.filter( ( name ) => node.classList.contains( name ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check to see whether there are even any classes assigned before running through this filter / removing the attribute?
https://developer.mozilla.org/en-US/docs/Web/API/DOMTokenList/length
blocks/api/raw-handling/utils.js
Outdated
} | ||
|
||
if ( node.hasChildNodes() ) { | ||
// Contine if the node is supposed to have children. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "Contine" -> "Continue"
47611e1
to
c63f77d
Compare
Addressed all actionable feedback and rebased. |
630afc1
to
c485fbd
Compare
71a1e3d
to
e93177e
Compare
Rebased with the latest block changes. |
Comment, test, clean up Do not allow figures without embedded content Adjust getContentSchema signature Destructure where possible Fix typo Add test for 315819d Move schemas to blocks Simplify Restore iframe filter Add Markdown integration test Address feedback Separate Markdown converter Remove unneeded nodeType checks
e93177e
to
038f086
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impressive work. We need to solve for IE before merging. After this PR, some docs are in order to explain the schema subsystem, how getPhrasingContentSchema
is useful for third-party blocks, etc.
Once IE is handled, 🚢
schema: { | ||
ol: listContentSchema.ol, | ||
ul: listContentSchema.ul, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just be schema: listContentSchema
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, no, it would contain phrasing content too. List content can contain phrasing content and lists.
|
||
// Recursion is needed. | ||
// Possible: ul > li > ul. | ||
// Impossible: ul > ul. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
.map( ( transform ) => { | ||
return transform.isMatch ? transform : { | ||
...transform, | ||
isMatch: ( node ) => transform.selector && node.matches( transform.selector ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like Element#matches
isn't supported on IE. Per Slack, resorting to a lib to polyfill may be in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the block library we can just keep using nodeName
for the simple checks and avoid pulling in the polyfill and confuse block authors.
Thank you @mcsf! Will create some docs for raw transforms in a separate PR. |
Great work here, @iseulde 🎈 |
Description
This branch simplifies whitelisting and reduces the amount of filters. The idea is that there is one schema with all possible paths that a pasted node tree can have. If something is different from what is possible, the node should unwrap or be deleted.
Why adding a schema with an infinite amount of paths? The whitelisting we had before was just too simplistic. It lacks context. For example, a list (
OL
) can only contain list items (LI
), a figure caption can only be part of a figure etc. The schema also takes care of a number of filters we had before by the generalisation, e.g. stripping text nodes out of tables where they don't belong, stripping comments, and finally it also incorporates attribute stripping.Instead of looping three times over the whole pasted node tree, we are now only looping twice. One filters and adjusts broken HTML, from the bottom of the tree to the top, and the other one removes invalid HTML according to the schema, from the top to the bottom of the tree.
Another big difference from master is that the schema is now part of the transforms that blocks register, not of the raw handler module itself. This is quite a nice separation and it also enables block authors to register more schemas to be taken into account when converting old content.
Also new is the
selector
property on transforms. In a lot of cases in can be used in favour of theisMatch( node )
function.I also got rid of the
inInvalidInline
filter which was looping through all nodes inside a loop. 🙈Additionally I tried to simplify and optimize some of the rest of the code, and add some doc blocks.
How Has This Been Tested?
Paste tests should pass.
Fixes case where a loose
figcaption
is kept and put in a classic block. The expected result is thefigcaption
element to unwrap. Try pasting from https://www.nytimes.com/2018/03/28/magazine/poem-the-nod.html.Checklist: