-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content after null byte is dropped #123
Comments
It looks like a right thing to do: In [3]: html = '<html>\x00<body>hello!</body></html>'
In [4]: html
Out[4]: '<html>\x00<body>hello!</body></html>'
In [5]: print(html)
<html><body>hello!</body></html>
In [6]: from lxml.html import etree
In [7]: etree.fromstring(html)
Traceback (most recent call last):
File "/Users/kmike/envs/deepdeep/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-7-aea2e2c2317e>", line 1, in <module>
etree.fromstring(html)
File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1758, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
XMLSyntaxError: Premature end of data in tag html line 1, line 1, column 7 and from parsel import Selector
In [15]: sel = Selector(html)
In [16]: html
Out[16]: '<html>\x00<body>hello!</body></html>'
In [17]: sel
Out[17]: <Selector xpath=None data='<html></html>'>
In [18]: sel = Selector(html.replace('\x00', ''))
In [19]: sel
Out[19]: <Selector xpath=None data='<html><body>hello!</body></html>'> |
Thanks @elacuesta , indeed this is fixed by #124 thanks to @peonone and @kmike , closing. |
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For some specific URL, there is a null byte (
\x00
) inside the response body, then all content after it gets dropped in thelxml
element tree.How about removing the null byte before sending it to
lxml
, then we will no longer need to add this logic in every project.The text was updated successfully, but these errors were encountered: