extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

keturn · 2020-02-10T01:22:00Z

The error is shown as "a bytes-like object is required, not str", but this is misleading, because the caller's error was that they did pass a bytes object.

Honestly not sure what the pythonic way to deal with this is.

Explicit assert isinstance type checking?
type annotations, and hope the user is running in an environment that will type check before they hit this exception?

html_text.extract_text(b'<html><body><p>Hello,   World!</p></body></html>')

…/python3.7/site-packages/html_text/html_text.py in parse_html(html)
     47     XXX: mostly copy-pasted from parsel.selector.create_root_node
     48     """
---> 49     body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
     50     parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
     51     root = lxml.etree.fromstring(body, parser=parser)

TypeError: a bytes-like object is required, not 'str'

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

The text was updated successfully, but these errors were encountered:

lopuhin · 2020-02-10T07:38:45Z

Yeah, it's .replace method of bytestring which raises this error, and it is confusing for the user. For html-text, having an explicit type check in extract_text seems like a good usability improvement to me, but with raising TypeError instead of an assert.

lopuhin · 2020-02-10T07:42:22Z

I guess that, for this specific line, its whole goal is to convert a string to a bytes object, so parse_html could skip that line if html is already bytes.

That's also possible, but note that this must be a utf8-encoded html, so if it's just a raw response result in a different encoding, then it would not work correctly. Accepting only strings makes sure we don't have this error, and it seems that the time to do re-encoding is small compared to text extraction time. But maybe it's fine to support bytes if the error on non-utf8 html is not too obscure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

keturn commented Feb 10, 2020

lopuhin commented Feb 10, 2020

lopuhin commented Feb 10, 2020

extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

extract_text fails with misleading error message when given bytes instead of unicode [py3] #26

Comments

keturn commented Feb 10, 2020

lopuhin commented Feb 10, 2020

lopuhin commented Feb 10, 2020