Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml doesn’t like control characters #96

Open
SimonSapin opened this issue Jul 22, 2013 · 26 comments
Open

lxml doesn’t like control characters #96

SimonSapin opened this issue Jul 22, 2013 · 26 comments
Assignees

Comments

@SimonSapin
Copy link
Contributor

Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

html5lib.parse('<p>&#1;', treebuilder='lxml')
html5lib.parse('<p>\x01', treebuilder='lxml')
html5lib.parse('<p id="&#1;">', treebuilder='lxml')
html5lib.parse('<p id="\x01">', treebuilder='lxml')
Traceback (most recent call last):
  File "/tmp/a.py", line 4, in <module>
    html5lib.parse('<p>&#1;', treebuilder='lxml')
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
    return p.parse(doc, encoding=encoding)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
    self.mainLoop()
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
    new_token = phase.processCharacters(new_token)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
    self.tree.insertText(token["data"])
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
    parent.insertText(data)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
  File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
  File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

DataLossWarning: Text cannot contain U+000C

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

@EmilStenstrom
Copy link

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

lastorset pushed a commit to lastorset/html5lib-python that referenced this issue Jun 6, 2014
lastorset pushed a commit to lastorset/html5lib-python that referenced this issue Jun 6, 2014
@lastorset
Copy link

Pull request: #162

@bradleyayers
Copy link

@SimonSapin these characters are not valid HTML, see https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Illegal_characters

@SimonSapin
Copy link
Contributor Author

@bradleyayers To support that claim, Wikipedia links to a document named "SGML Declaration of HTML 4" and published in 1999. The relevant specification is https://whatwg.org/html. Also, what do you mean exactly by "valid"? Conformance requirements are different for authors and implementations.

@bradleyayers
Copy link

By "valid" I mean "able to be represented in HTML". I'm saying it's not possible to represent U+0001 in HTML.

It can't be represented by numeric character references (see https://html.spec.whatwg.org/multipage/syntax.html#character-references):

The numeric character reference forms described above are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.

Nor can they be represented by encoding them (see https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream):

Any occurrences of any characters in the ranges U+0001 to U+0008, …snip… are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).

@gsnedders
Copy link
Member

For the reference for numeric character references, note the start of the section:

This section only applies to documents, authoring tools, and markup generators. In particular, it does not apply to conformance checkers; conformance checkers must use the requirements given in the next section ("parsing HTML documents").

The actual parsing of a character reference (https://html.spec.whatwg.org/multipage/syntax.html#tokenizing-character-references) says:

Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.

Follow the cross-reference for parse error:

Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.

If you follow the error handling, note those characters are never replaced by anything else, and hence they end up in the DOM. The same is true for the ranges in the pre-processing.

@EmilStenstrom
Copy link

@gsnedders Do I understand this correctly that this is not a bug in html5lib (which just lets these characters through according to spec), but in lxml which does not expect those characters?

@SimonSapin
Copy link
Contributor Author

@EmilStenstrom I’d say that’s debatable. https://html.spec.whatwg.org/multipage/#coercing-an-html-dom-into-an-infoset describes how to map to a more restricted XML API. The problem is raising exceptions rather than doing this coercion.

@EmilStenstrom
Copy link

@SimonSapin Doesn't the exception come from lxml rather than html5lib?

@SimonSapin
Copy link
Contributor Author

It does, but html5lib should munge the data to avoid trigerring this exception.

@willkg
Copy link
Contributor

willkg commented Nov 10, 2017

I've got part of a PR done. As I understand it, we need to replace bad characters as follows:

  1. in elements -> "Uxxxx"
  2. in attribute names -> "Uxxxx"
  3. in attribute values -> "\uFFFD"
  4. in characters -> "\uFFFD"
  5. in comments -> "\uFFFD"

The code for doing this should be in html5lib/_ihatexml:InfosetFilter.

After that, the following should all work:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<pU1 id=foo></pU1>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p iU1d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

You can run that through python -m doctest.

Is that the behavior we're looking for to match correct HTML parsing?

@SimonSapin
Copy link
Contributor Author

@willk That looks correct, except that U + hexadecimal should be zero-padded to six digits according to https://html.spec.whatwg.org/multipage/parsing.html#coercing-an-html-dom-into-an-infoset, for example foo<bar to fooU00003Cbar.

@willkg
Copy link
Contributor

willkg commented Nov 10, 2017

Ok--so the fixed version of the doctest looks like this:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<pU00001 id=foo></pU00001>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p iU000001d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

For some reason, when I run that, the element item doesn't pass. Looks like something is converting U000001 back to \x001 after InfosetFilter has fixed things. Here's the output I'm getting:

$ python -m doctest code_96.py 
html5lib/_ihatexml.py:231: DataLossWarning: Text cannot contain U+000001
  warnings.warn("Text cannot contain U+%06X" % ord(badchar), DataLossWarning)
**********************************************************************
File "code_96.py", line 7, in code_96
Failed example:
    serialize(parse('<p� id="foo">', treebuilder='lxml'), tree='lxml')
Expected:
    u'<pU00001 id=foo></pU00001>'
Got:
    u'<p\x001 id=foo></p\x001>'
**********************************************************************
1 items had failures:
   1 of   6 in code_96
***Test Failed*** 1 failures.

I'll have to look into that more.

@willkg
Copy link
Contributor

willkg commented Nov 11, 2017

Aha! In InfosetFilter.fromXmlName, it runs the element name through replacementRegexp which converts the U00001 to \x001.

for item in set(self.replacementRegexp.findall(name)):

Assuming that's correct, then we should be running attribute names through fromXmlName, too. So then we end up with this:

"""
>>> from html5lib import parse, serialize
>>> serialize(parse('<p>\x01</p>', treebuilder='lxml'), tree='lxml')
u'<p>\ufffd'
>>> serialize(parse('<p\x01 id="foo">', treebuilder='lxml'), tree='lxml')
u'<p\x001 id=foo></pU00001>'
>>> serialize(parse('<p i\x01d="foo">', treebuilder='lxml'), tree='lxml')
u'<p i\x001d=foo>'
>>> serialize(parse('<p id="\x01">', treebuilder='lxml'), tree='lxml')
u'<p id=\ufffd>'
>>> serialize(parse('<!-- \x01 is the best -->', treebuilder='lxml'), tree='lxml')
u'<!-- \ufffd is the best -->'
"""

@gsnedders I'm a bit fuzzy on the appropriate semantics for HTML -> XML -> HTML. Does that look correct?

@gsnedders
Copy link
Member

So there are many more complex cases, primarily those outside of the BMP, especially once you start worrying about narrow/wide Python builds.

There's also the big difference between XML 1.0 4th Edition and 5th Edition, which depending on version of libxml2 that lxml is linked against (or compiled against? not sure which), described in E9 of https://www.w3.org/XML/xml-V10-4e-errata#E09 (IMO, this change should never have happened, but that ship has sailed). This notably changes the definition of Name.

These challenges are why I've never actually fixed this, because while there are easy fixes for the easy cases, the underlying problem is much wider.

@gsnedders gsnedders self-assigned this Nov 13, 2017
@gsnedders
Copy link
Member

I'm trying to take another stab at this right now, fixing this generally.

@willkg
Copy link
Contributor

willkg commented Nov 13, 2017

@gsnedders If it helps, I threw my WIP in a branch here: https://github.com/willkg/html5lib-python/tree/96-control-characters

@willkg
Copy link
Contributor

willkg commented Nov 27, 2017

I'm going to bump this out of the 1.0 milestone.

@gsnedders If you can get to this before December 1st, I'm game for re-adding it.

@willkg willkg removed this from the 1.0 milestone Nov 27, 2017
@lpla
Copy link

lpla commented May 7, 2020

Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib:

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

In case anyone needs to use @EmilStenstrom code in Python 3, I just ported it:

def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return chr(int(s, base)).encode()
        return default
    html = re.sub(br"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(br"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(br"[\x00-\x08\x0b\x0e-\x1f\x7f]", b"", html)
    return html

@EmilStenstrom
Copy link

EmilStenstrom commented May 7, 2020

@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints:

import re

def remove_control_characters(html):
    # type: (t.Text) -> t.Text
    """
    Strip invalid XML characters that `lxml` cannot parse.
    """
    # See: https://github.com/html5lib/html5lib-python/issues/96
    #
    # The XML 1.0 spec defines the valid character range as:
    # Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    #
    # We can instead match the invalid characters by inverting that range into:
    # InvalidChar ::= #xb | #xc | #xFFFE | #xFFFF | [#x0-#x8] | [#xe-#x1F] | [#xD800-#xDFFF]
    #
    # Sources:
    # https://www.w3.org/TR/REC-xml/#charsets,
    # https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
    def strip_illegal_xml_characters(s, default, base=10):
        # Compare the "invalid XML character range" numerically
        n = int(s, base)
        if n in (0xb, 0xc, 0xFFFE, 0xFFFF) or 0x0 <= n <= 0x8 or 0xe <= n <= 0x1F or 0xD800 <= n <= 0xDFFF:
            return ""
        return default

    # We encode all non-ascii characters to XML char-refs, so for example "💖" becomes: "&#x1F496;"
    # Otherwise we'd remove emojis by mistake on narrow-unicode builds of Python
    html = html.encode("ascii", "xmlcharrefreplace").decode("utf-8")
    html = re.sub(r"&#(\d+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0)), html)
    html = re.sub(r"&#[xX]([0-9a-fA-F]+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0), base=16), html)
    html = ILLEGAL_XML_CHARS_RE.sub("", html)
    return html


# A regex matching the "invalid XML character range"
ILLEGAL_XML_CHARS_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]")

@lpla
Copy link

lpla commented May 7, 2020

Any license for this code?

@EmilStenstrom
Copy link

Any license for this code?

I hereby release it as public domain.

@gsnedders
Copy link
Member

At this point, the only Python release we support narrow builds on is 2.7; all versions of Py3 we support are always wide. This, to be fair, makes this a lot easier to fix, so we should probably take a stab at this soon.

@SimonSapin
Copy link
Contributor Author

How do narrow v.s. wide builds affect lxml/libxml2 being peculiar about control characters?

@gsnedders
Copy link
Member

gsnedders commented Jun 17, 2020

@SimonSapin they don't (the string is converted to UTF-8 before being passed to libxml2 IIRC), but they do affect our ability to detect what strings will trigger it (given we can't just iterate through a string and compare the iterable values to the production in XML, either ourselves or with a regex); the only complexity is whether libxml2 is enforcing XML 4e or 5e

@sshishov
Copy link

@gsnedders how are you doing? Are we going to try to tackle the issue or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants