You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given the same HTML code, here is what different parsers see :
=== HTML ===
<li>
one
<div>
</li>
<li>
two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
one
<div>
<li>
two
</li></div></li></body></html>
=== html.parser ===
<li>
one
<div>
</div>
</li>
<li>
two
</li>
=== lxml (same problem as parsel of course) ===
<html>
<body>
<li>
one
<div>
<li>
two
</li>
</div>
</li>
</body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
<head>
</head>
<body>
<li>
one
<div>
</div>
</li>
<li>
two
</li>
</body>
</html>
This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.
#!/usr/bin/env python
from parsel import Selector
from bs4 import BeautifulSoup
print('=== HTML ===')
html = '''<li>
one
<div>
</li>
<li>
two
</li>'''
print(html)
print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())
print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())
The text was updated successfully, but these errors were encountered:
See related #54 which adds a parser_cls attribute to customize the parser.
Note that scrapy/parsel favors speed (lxml) over browser-parsing compliance: html5lib is still much slower than lxml (as far as I know, I didn't check recently)
Given the same HTML code, here is what different parsers see :
This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.
The text was updated successfully, but these errors were encountered: