-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 3.10 #959
Python 3.10 #959
Conversation
inbox/util/html.py
Outdated
try: | ||
s.feed(html) | ||
except AssertionError as e: | ||
if e.args[0].startswith("unknown status keyword"): | ||
raise HTMLParseError(*e.args) from e | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why was this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python parser in Python 3.10 started throwing AssertionErrors on some invalid HTML, those were previously HTMLParserError. I think it's opinionated weather it should be AssertionErrors , especially when treating with user supplied data (which emails are). To me AssertionErrors are programmer errors, not user errors. We actually have a test for those, meaning that at some point Nylas people hit this kind of HTML that would now trigger AssertionError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the failure here.
def test_strip_tags():
text = (
"<div><script> AAH JAVASCRIPT</script><style> AAH CSS AHH</style>"
'check out this <a href="http://example.com/">link</a> yo!</div>'
)
assert strip_tags(text).strip() == "check out this link yo!"
# MS Word conditional marked section
text = "<![if word]>content<![endif]>"
assert strip_tags(text).strip() == "content"
# Unknown marked section
text = """<![FOR]>"""
with pytest.raises(HTMLParseError):
> strip_tags(text)
...
def parse_marked_section(self, i, report=1):
rawdata= self.rawdata
assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
sectName, j = self._scan_name( i+3, i )
if j < 0:
return j
if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
# look for standard ]]> ending
match= _markedsectionclose.search(rawdata, i+3)
elif sectName in {"if", "else", "endif"}:
# look for MS Office ]> ending
match= _msmarkedsectionclose.search(rawdata, i+3)
else:
> raise AssertionError(
'unknown status keyword %r in marked section' % rawdata[i+3:j]
)
E AssertionError: unknown status keyword 'FOR' in marked section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojcikstefan should I add clarification comment maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, makes sense 👍 Yea, I think it's worth adding a comment – it might not be clear at a glance why we don't work directly with the AssertionError
further up the stack.
Are there other invalid HTML errors that will still raise an AssertionError
? Should we re-raise all s.feed(html)
assertion errors as HTML parser errors? Or does s.feed
do more than just parse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally if those would become AssertionErrors we would not catch them here
def calculate_html_snippet(self, text: str) -> str:
try:
text = strip_tags(text)
except HTMLParseError:
log.error(
"error stripping tags", message_nylas_uid=self.nylas_uid, exc_info=True
)
text = ""
return self.calculate_plaintext_snippet(text)
meaning that we would discard the entire email, instead of discarding just the snippet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We technically could replace except HTMLParserError
with except AssertionError
though, right? Or are there other AssertionErrors
being thrown that we should definitely NOT catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojcikstefan I did a deep dive, found the change in CPython that causes this:
As you can see they have removed the error callback, we were using the error callback to raise our exception. But that method was long deprecated. I still think the choice to raise AssertionError is the wrong choice on CPython developers, I think a subclass of ValueError would be better. So I agree we should catch all the AssertionErrors.
There are some AssertionErrors that are programming errors in the source of the parser, like this one, but they should not happen.
Thanks for catching this. Will follow with a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojcikstefan Pushed a fix in 4ae8c49
(#959). Assigning you as a reviewer as well because you spent a lot of time looking at this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Do you think it's worth rolling out partially first to confirm we didn't miss anything?
4ae8c49
to
735d36a
Compare
@wojcikstefan Yes, I will definitely canary this. |
735d36a
to
927b60c
Compare
Supersedes: #439