You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The error is shown as "a bytes-like object is required, not str", but this is misleading, because the caller's error was that they did pass a bytes object.
Honestly not sure what the pythonic way to deal with this is.
Explicit assert isinstance type checking?
type annotations, and hope the user is running in an environment that will type check before they hit this exception?
…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
47 XXX: mostly copy-pasted from parsel.selector.create_root_node
48 """
---> 49 body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
50 parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
51 root = lxml.etree.fromstring(body, parser=parser)
TypeError: a bytes-like object is required, not 'str'
I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.
The text was updated successfully, but these errors were encountered:
Yeah, it's .replace method of bytestring which raises this error, and it is confusing for the user. For html-text, having an explicit type check in extract_text seems like a good usability improvement to me, but with raising TypeError instead of an assert.
I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.
That's also possible, but note that this must be a utf8-encoded html, so if it's just a raw response result in a different encoding, then it would not work correctly. Accepting only strings makes sure we don't have this error, and it seems that the time to do re-encoding is small compared to text extraction time. But maybe it's fine to support bytes if the error on non-utf8 html is not too obscure.
The error is shown as "a bytes-like object is required, not
str
", but this is misleading, because the caller's error was that they did pass a bytes object.Honestly not sure what the pythonic way to deal with this is.
assert isinstance
type checking?I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so
parse_html
could skip that line ifhtml
is already bytes.The text was updated successfully, but these errors were encountered: