-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classic + Custom HTML blocks: Convert to Blocks removes valid inline formatting #6102
Comments
@danielbachhuber I did a full test with all valid HTML5 inline semantic elements and here are the results. I put the following into a Classic block: <abbr>abbr</abbr> <b>b</b> <br>br <bdi>bdi</bdi> <bdo dir="rtl">bdo</bdo> <cite>cite</cite> <code>code</code> <data value="value">data</data> <dfn>dfn</dfn> <em>em</em> <i>i</i> <kbd>kbd</kbd> <mark>mark</mark> <q>q</q> <ruby>ruby <rb>rb</rb> <rp>rp</rp> <rt>rt</rt> <rtc>rtc</rtc> <rp>rp</rp></ruby> <s>s</s> <samp>samp</samp> <small>small</small> <span style="color:red">span</span> <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> <time datetime="2018">time</time> <u>u</u> <var>var</var> <wbr>wbr I then converted it to a Paragraph block. Everything should have been the same except for the added <p><abbr>abbr</abbr> <strong>b</strong> <br/>br bdi bdo cite <code>code</code> data dfn <em>em</em> <em>i</em> kbd mark q ruby rb rp rt rtc rp s samp small span <strong>strong</strong> <sub>sub</sub> <sup>sup</sup> time u var wbr</p> And yes, The current behavior seems to have improved from when I first made the issue, but a lot of tags are still stripped out. Personally, I think this should be resolved before the Try Callout goes out. |
#6878 and #7604 are both related. The first is relevant because there's still some amount of the HTML5 spec we need to accommodate for: #6878 (comment) The second because, in the scenario where we're stripping out HTML, we need to improve the user experience. I don't think this needs to be resolved prior to the Try Callout though, for two reasons:
|
@danielbachhuber Okay, fair enough. 👍 |
@ZebulanStanphill @danielbachhuber F.Y.I. const phrasingContentSchema = {
strong: {},
em: {},
del: {},
ins: {},
a: { attributes: [ 'href', 'target', 'rel' ] },
code: {},
abbr: { attributes: [ 'title' ] },
sub: {},
sup: {},
br: {},
ruby: {
children: {
rt: {
children: {
'#text': {}
}
},
'#text': {}
}
},
'#text': {},
}; Now it works and classic block conversion keeps expected HTML. Fundamentally, In my opinion, JS filter hook for |
I got a similar behaviour with the Unfortunately in this case it gets attached to a youtube link before them and are not "visible" to the user after the converting. Test:
HTML before: (links exist) <p>https://www.youtube.com/watch?v=2DkaLmUHYOw&rel=0</p>
<address><a href="http://www.wope.net" target="_blank" mce_href="http://www.wope.net">www.wope.net</a></address>
<address style="text-align: justify;" mce_style="text-align: justify;"><a href="http://www.mediaresourcegroup.de" target="_blank" mce_href="http://www.mediaresourcegroup.de">www.mediaresourcegroup.de</a><p></p>
<p>Foto/Text/Video: Markus Wilmsmann</p>
</address> HTML after: (links are added to youtube link) <!-- wp:core-embed/youtube {"url":"https://www.youtube.com/watch?v=2DkaLmUHYOw\u0026rel=0www.wope.netwww.mediaresourcegroup.de"} -->
<figure class="wp-block-embed-youtube wp-block-embed"><div class="wp-block-embed__wrapper">
https://www.youtube.com/watch?v=2DkaLmUHYOw&rel=0www.wope.netwww.mediaresourcegroup.de
</div></figure>
<!-- /wp:core-embed/youtube -->
<!-- wp:paragraph -->
<p>Foto/Text/Video: Markus Wilmsmann</p>
<!-- /wp:paragraph --> |
As an everyday user, preserving inline markup during "convert to blocks" is critical to making the transition to Gutenberg. I manage a college website with around 350 active pages. At least 80 - 90% of them use inline HTML for styling, primarily . If that all gets wiped out, "convert to blocks" is not an option. Having to manually create each block, and then copy and paste each chunk of content manually, will be a very time-consuming project! This is not blog content, with headings, straightforward paragraphs, and a few images. Many of our pages are very information heavy, and text formatting is an important tool for keeping pages scannable and not too intimidating. |
This is made challenging by the fact that at the root of this conversion feature is the raw handler, which is also responsible for adapting pasted content from external sources to blocks. The HTML received by these sources (e.g. Google Docs) is littered with attributes and markup which makes sense only in the context of Google Docs, and not as a standalone document. This is a major part of the reason the filtering is as strict as it is. For example, consider an input document: https://docs.google.com/document/d/1wwL4H10bDUDAiYVAczLthGCdwxsNZ9YhWpWwO-YId_w/edit?usp=sharing Whose copied text yields the pasted HTML: <meta charset='utf-8'><meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-6a3f5726-7fff-f826-dcd7-8eef2659d75a"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><a href="https://drive.google.com/drive/u/0/folders/1k4bWkN088Hte1mehmPkKZHbois4Zjsar" style="text-decoration:none;"><span style="font-size:11pt;font-family:Arial;color:#1155cc;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">(View All Agendas)</span></a></p><br /><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Slack Transcript:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Summary Notes:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Note Taker:</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> TBD</span></p><br /><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><hr /></p><br /><br /></b> So while filtering the phrasing content schema could provide some improvement to classic block conversion, it might also inadvertently reintroduce some of the unwanted markup from these other sources. Perhaps then if we consider the classic conversion to be exempt to filtering, or conversely the paste handling to be exceptionally strict, we could do one of the following:
cc @iseulde : Since I'm not as familiar with the raw handling, have these sorts of approaches been considered previously? |
Let's close in favor of #6878 for technical implementation. |
Let's keep this open until the issue is fully addressed, and then we can use this issue to verify the implementation is complete. Here's a prior example of an issue that was closed prematurely #6648 (comment) |
@danielbachhuber This is a complete duplicate. Resolving one will resolve the other. The other issue you mention seems to contain several action items. |
I'll close the other one then in favour of this. |
|
I've also found "the cleanse" helpful in some cases. Would certainly be nice if there was a choice for "retain or clean" markup. But every |
PR: #11539. |
Issue Overview
The Paragraph block, as well as other textual blocks, allow
<abbr>
,<b>
,<code>
,<i>
,<kbd>
,<mark>
,<span>
,<time>
, and various other inline formatting and semantic tags to be added using the "Edit as HTML" option, and these are considered valid and are not removed on save.However, when converting a Classic block to standard blocks using the "Convert to Blocks" option, the paragraphs, lists, and blockquotes are stripped of some inline formatting tags, and a few are converted to tags that have a different semantic meaning.
I think inline formatting that is considered valid by the standard blocks should not be removed when converting a Classic block to standard blocks. Otherwise, you are stripping out formatting from the original post when you do not need to.
Steps to Reproduce (for bugs)
<b>
tags are converted to<strong>
tags.Expected Behavior
Converting a Classic block to standard blocks using the "Convert to Blocks" option should preserve all inline formatting tags that are considered valid by the resulting blocks.
Current Behavior
Converting a Classic block to standard blocks using the "Convert to Blocks" option removes several valid HTML5 tags, and converts some tags to other tags with different semantic meanings, e.g.
<b>
tags are converted to<strong>
tags. The previously given sample input above is transformed into the following:Other Notes
If a Classic block contains obsolete HTML tags like
<color>
, then converting them to<span>
tags upon conversion to standard blocks seems like a good idea.<b>
and<i>
tags are converted to<strong>
and<em>
respectively. However, it should be considered that there are valid cases where a<b>
or<i>
are used semantically as per the HTML5 specification. Additionally,<span>
tags containingfont-weight:bold
orfont-style:italic
are converted to<strong>
or<em>
tags respectively, and this seems like a bad idea. Most of the time, when someone makes text bold or italic using a<span>
tag, they are doing it for purely stylistic reasons and not semantic reasons, so converting those to<strong>
and<em>
tags seems like a bad idea.Related Issues and/or PRs
Converting To Blocks Looses Existing Anchor Tag Parameters #4498Allowable tags gettings stripped from Header blocks when drafts are auto-saved #5876Converting a table in a classic block to a table block removes all inline styles #6096The text was updated successfully, but these errors were encountered: