Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-closing tags (such as <p/>) are processed incorrectly #192

Open
chrispy-snps opened this issue Jun 7, 2024 · 0 comments
Open

Self-closing tags (such as <p/>) are processed incorrectly #192

chrispy-snps opened this issue Jun 7, 2024 · 0 comments

Comments

@chrispy-snps
Copy link

chrispy-snps commented Jun 7, 2024

This is a more specific follow-up to #181.

When a self-closing tag is processed (such as <p/>), the output is an incorrectly unclosed tag (such as <p>). This causes significant structural issues when the content is read back in.

For example, the following code:

import minify_html

html = """
<p>
  <span>ABC</span>
  <span/>
  <span/>
  <span/>
  <span/>
  <span/>
  <span>DEF</span>
</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>ABC</mi>
  <mi/>
  <mi/>
  <mi/>
  <mi>DEF</mi>
</math>
"""

html_small = minify_html.minify(html, keep_closing_tags=True)
print(html_small)

results in the following HTML (added linefeeds are mine):

<p>
<span>ABC</span>
<span>
<span>
<span>
<span>
<span>
<span>DEF</span>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi>
<mi>
<mi>
<mi>DEF</mi>

which is interpreted by a browser (Firefox) as follows:

<p>
  <span>ABC</span>
  <span>
    <span>
      <span>
        <span>
          <span>
	        <span>DEF</span>
	        <math xmlns="http://www.w3.org/1998/Math/MathML">
              <mi>ABC</mi>
              <mi>
                <mi>
                  <mi>
                    <mi>DEF</mi>
                  </mi>
                </mi>
              </mi>
            </math>
          </span>
        </span>
      </span>
    </span>
  </span>
</p>

Self-closing <p/> elements pose a similar issue. While many browsers will force-close adjacent unclosed <p> elements due to their block-element-ness, many parsers (such as lxml) do not, and a similar cascade of misclosed <p> tags occurs there too.

We are able to work around it as follows:

import re

html = re.sub(
    r"<([^\s>]+)([^>]*)/>",
    r"<\1\2></\1>",
    html,
    flags=re.DOTALL,
)

but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True (when working with a downstream parser that expects predominantly well-formed HTML).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant