You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a self-closing tag is processed (such as <p/>), the output is an incorrectly unclosed tag (such as <p>). This causes significant structural issues when the content is read back in.
Self-closing <p/> elements pose a similar issue. While many browsers will force-close adjacent unclosed <p> elements due to their block-element-ness, many parsers (such as lxml) do not, and a similar cascade of misclosed <p> tags occurs there too.
but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True (when working with a downstream parser that expects predominantly well-formed HTML).
The text was updated successfully, but these errors were encountered:
This is a more specific follow-up to #181.
When a self-closing tag is processed (such as
<p/>
), the output is an incorrectly unclosed tag (such as<p>
). This causes significant structural issues when the content is read back in.For example, the following code:
results in the following HTML (added linefeeds are mine):
which is interpreted by a browser (Firefox) as follows:
Self-closing
<p/>
elements pose a similar issue. While many browsers will force-close adjacent unclosed<p>
elements due to their block-element-ness, many parsers (such aslxml
) do not, and a similar cascade of misclosed<p>
tags occurs there too.We are able to work around it as follows:
but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when
keep_closing_tags==True
(when working with a downstream parser that expects predominantly well-formed HTML).The text was updated successfully, but these errors were encountered: