-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lxml doesn’t like control characters #96
Comments
Here's a workaround for anyone that needs to get things working before this bug is fixed. Just run this code over the html before sending it to html5lib: import re
def remove_control_characters(html):
def str_to_int(s, default, base=10):
if int(s, base) < 0x10000:
return unichr(int(s, base))
return default
html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
return html |
Pull request: #162 |
@SimonSapin these characters are not valid HTML, see https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Illegal_characters |
@bradleyayers To support that claim, Wikipedia links to a document named "SGML Declaration of HTML 4" and published in 1999. The relevant specification is https://whatwg.org/html. Also, what do you mean exactly by "valid"? Conformance requirements are different for authors and implementations. |
By "valid" I mean "able to be represented in HTML". I'm saying it's not possible to represent U+0001 in HTML. It can't be represented by numeric character references (see https://html.spec.whatwg.org/multipage/syntax.html#character-references):
Nor can they be represented by encoding them (see https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream):
|
For the reference for numeric character references, note the start of the section:
The actual parsing of a character reference (https://html.spec.whatwg.org/multipage/syntax.html#tokenizing-character-references) says:
Follow the cross-reference for parse error:
If you follow the error handling, note those characters are never replaced by anything else, and hence they end up in the DOM. The same is true for the ranges in the pre-processing. |
@gsnedders Do I understand this correctly that this is not a bug in html5lib (which just lets these characters through according to spec), but in lxml which does not expect those characters? |
@EmilStenstrom I’d say that’s debatable. https://html.spec.whatwg.org/multipage/#coercing-an-html-dom-into-an-infoset describes how to map to a more restricted XML API. The problem is raising exceptions rather than doing this coercion. |
@SimonSapin Doesn't the exception come from lxml rather than html5lib? |
It does, but html5lib should munge the data to avoid trigerring this exception. |
I've got part of a PR done. As I understand it, we need to replace bad characters as follows:
The code for doing this should be in After that, the following should all work:
You can run that through Is that the behavior we're looking for to match correct HTML parsing? |
@willk That looks correct, except that |
Ok--so the fixed version of the doctest looks like this:
For some reason, when I run that, the element item doesn't pass. Looks like something is converting
I'll have to look into that more. |
Aha! In html5lib-python/html5lib/_ihatexml.py Line 278 in 85bc5fa
Assuming that's correct, then we should be running attribute names through
@gsnedders I'm a bit fuzzy on the appropriate semantics for HTML -> XML -> HTML. Does that look correct? |
So there are many more complex cases, primarily those outside of the BMP, especially once you start worrying about narrow/wide Python builds. There's also the big difference between XML 1.0 4th Edition and 5th Edition, which depending on version of These challenges are why I've never actually fixed this, because while there are easy fixes for the easy cases, the underlying problem is much wider. |
I'm trying to take another stab at this right now, fixing this generally. |
@gsnedders If it helps, I threw my WIP in a branch here: https://github.com/willkg/html5lib-python/tree/96-control-characters |
I'm going to bump this out of the 1.0 milestone. @gsnedders If you can get to this before December 1st, I'm game for re-adding it. |
In case anyone needs to use @EmilStenstrom code in Python 3, I just ported it: def remove_control_characters(html):
def str_to_int(s, default, base=10):
if int(s, base) < 0x10000:
return chr(int(s, base)).encode()
return default
html = re.sub(br"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
html = re.sub(br"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
html = re.sub(br"[\x00-\x08\x0b\x0e-\x1f\x7f]", b"", html)
return html |
@lpla Ours have evolved after slowly correcting errors when parsing erroneously encoded text in hundreds of thousands of HTML e-mails. This is the current version we are using, compatible with both python 2 (narrow and wide builds) and python 3, and with type hints: import re
def remove_control_characters(html):
# type: (t.Text) -> t.Text
"""
Strip invalid XML characters that `lxml` cannot parse.
"""
# See: https://github.com/html5lib/html5lib-python/issues/96
#
# The XML 1.0 spec defines the valid character range as:
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
#
# We can instead match the invalid characters by inverting that range into:
# InvalidChar ::= #xb | #xc | #xFFFE | #xFFFF | [#x0-#x8] | [#xe-#x1F] | [#xD800-#xDFFF]
#
# Sources:
# https://www.w3.org/TR/REC-xml/#charsets,
# https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
def strip_illegal_xml_characters(s, default, base=10):
# Compare the "invalid XML character range" numerically
n = int(s, base)
if n in (0xb, 0xc, 0xFFFE, 0xFFFF) or 0x0 <= n <= 0x8 or 0xe <= n <= 0x1F or 0xD800 <= n <= 0xDFFF:
return ""
return default
# We encode all non-ascii characters to XML char-refs, so for example "💖" becomes: "💖"
# Otherwise we'd remove emojis by mistake on narrow-unicode builds of Python
html = html.encode("ascii", "xmlcharrefreplace").decode("utf-8")
html = re.sub(r"&#(\d+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0)), html)
html = re.sub(r"&#[xX]([0-9a-fA-F]+);?", lambda c: strip_illegal_xml_characters(c.group(1), c.group(0), base=16), html)
html = ILLEGAL_XML_CHARS_RE.sub("", html)
return html
# A regex matching the "invalid XML character range"
ILLEGAL_XML_CHARS_RE = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]") |
Any license for this code? |
I hereby release it as public domain. |
At this point, the only Python release we support narrow builds on is 2.7; all versions of Py3 we support are always wide. This, to be fair, makes this a lot easier to fix, so we should probably take a stab at this soon. |
How do narrow v.s. wide builds affect lxml/libxml2 being peculiar about control characters? |
@SimonSapin they don't (the string is converted to UTF-8 before being passed to libxml2 IIRC), but they do affect our ability to detect what strings will trigger it (given we can't just iterate through a string and compare the iterable values to the production in XML, either ourselves or with a regex); the only complexity is whether libxml2 is enforcing XML 4e or 5e |
@gsnedders how are you doing? Are we going to try to tackle the issue or not? |
Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.
Each of these trigger the exception below:
U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:
libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.
The text was updated successfully, but these errors were encountered: