lxml doesn’t like control characters

Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.

Each of these trigger the exception below:

```
html5lib.parse('&#1;', treebuilder='lxml')
html5lib.parse('\x01', treebuilder='lxml')
html5lib.parse('', treebuilder='lxml')
html5lib.parse('', treebuilder='lxml')
```

```
Traceback (most recent call last):
 File "/tmp/a.py", line 4, in <module>
 html5lib.parse('&#1;', treebuilder='lxml')
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 28, in parse
 return p.parse(doc, encoding=encoding)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 224, in parse
 parseMeta=parseMeta, useChardet=useChardet)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 93, in _parse
 self.mainLoop()
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 183, in mainLoop
 new_token = phase.processCharacters(new_token)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py", line 991, in processCharacters
 self.tree.insertText(token["data"])
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py", line 320, in insertText
 parent.insertText(data)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py", line 240, in insertText
 builder.Element.insertText(self, data, insertBefore)
 File "/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py", line 108, in insertText
 self._element.text += data
 File "lxml.etree.pyx", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)
 File "apihelpers.pxi", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)
 File "apihelpers.pxi", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
```

U+000C in text (but not in attribute values) is replaced by U+0020 with a warning:

```
DataLossWarning: Text cannot contain U+000C
```

libxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lxml doesn’t like control characters #96

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lxml doesn’t like control characters #96

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions