Python 3.10 #959

squeaky-pl · 2024-10-28T17:09:26Z

Supersedes: #439

wojcikstefan · 2024-10-29T09:12:05Z

inbox/util/html.py

+    try:
+        s.feed(html)
+    except AssertionError as e:
+        if e.args[0].startswith("unknown status keyword"):
+            raise HTMLParseError(*e.args) from e
+        raise


I'm curious, why was this change needed?

The python parser in Python 3.10 started throwing AssertionErrors on some invalid HTML, those were previously HTMLParserError. I think it's opinionated weather it should be AssertionErrors , especially when treating with user supplied data (which emails are). To me AssertionErrors are programmer errors, not user errors. We actually have a test for those, meaning that at some point Nylas people hit this kind of HTML that would now trigger AssertionError.

You can see the failure here.

def test_strip_tags(): text = ( "<div><script> AAH JAVASCRIPT</script><style> AAH CSS AHH</style>" 'check out this <a href="http://example.com/">link</a> yo!</div>' ) assert strip_tags(text).strip() == "check out this link yo!" # MS Word conditional marked section text = "<![if word]>content<![endif]>" assert strip_tags(text).strip() == "content" # Unknown marked section text = """<![FOR]>""" with pytest.raises(HTMLParseError): > strip_tags(text) ... def parse_marked_section(self, i, report=1): rawdata= self.rawdata assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()" sectName, j = self._scan_name( i+3, i ) if j < 0: return j if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}: # look for standard ]]> ending match= _markedsectionclose.search(rawdata, i+3) elif sectName in {"if", "else", "endif"}: # look for MS Office ]> ending match= _msmarkedsectionclose.search(rawdata, i+3) else: > raise AssertionError( 'unknown status keyword %r in marked section' % rawdata[i+3:j] ) E AssertionError: unknown status keyword 'FOR' in marked section

@wojcikstefan should I add clarification comment maybe?

Thanks for explaining, makes sense 👍 Yea, I think it's worth adding a comment – it might not be clear at a glance why we don't work directly with the AssertionError further up the stack.

Are there other invalid HTML errors that will still raise an AssertionError? Should we re-raise all s.feed(html) assertion errors as HTML parser errors? Or does s.feed do more than just parse?

Additionally if those would become AssertionErrors we would not catch them here

def calculate_html_snippet(self, text: str) -> str: try: text = strip_tags(text) except HTMLParseError: log.error( "error stripping tags", message_nylas_uid=self.nylas_uid, exc_info=True ) text = "" return self.calculate_plaintext_snippet(text)

meaning that we would discard the entire email, instead of discarding just the snippet.

We technically could replace except HTMLParserError with except AssertionError though, right? Or are there other AssertionErrors being thrown that we should definitely NOT catch?

@wojcikstefan I did a deep dive, found the change in CPython that causes this:

python/cpython@e34bbfd

As you can see they have removed the error callback, we were using the error callback to raise our exception. But that method was long deprecated. I still think the choice to raise AssertionError is the wrong choice on CPython developers, I think a subclass of ValueError would be better. So I agree we should catch all the AssertionErrors.

There are some AssertionErrors that are programming errors in the source of the parser, like this one, but they should not happen.

Thanks for catching this. Will follow with a fix.

@wojcikstefan Pushed a fix in 4ae8c49 (#959). Assigning you as a reviewer as well because you spent a lot of time looking at this already.

wojcikstefan

Looks good! Do you think it's worth rolling out partially first to confirm we didn't miss anything?

squeaky-pl · 2024-10-29T12:46:17Z

@wojcikstefan Yes, I will definitely canary this.

squeaky-pl mentioned this pull request Oct 28, 2024

Python 3.12 #960

Merged

2 tasks

wojcikstefan mentioned this pull request Oct 29, 2024

Python 3.10 #439

Closed

wojcikstefan reviewed Oct 29, 2024

View reviewed changes

squeaky-pl requested a review from wojcikstefan October 29, 2024 11:17

wojcikstefan approved these changes Oct 29, 2024

View reviewed changes

squeaky-pl force-pushed the python-310 branch from 4ae8c49 to 735d36a Compare October 29, 2024 12:43

squeaky-pl added 3 commits October 29, 2024 14:17

Dockerfile & requirements

5231efd

Fix tests

09f35ee

Treat all AssertionErrors as HTMLParseErrors

927b60c

squeaky-pl force-pushed the python-310 branch from 735d36a to 927b60c Compare October 29, 2024 13:18

squeaky-pl merged commit 1a6d56d into master Oct 29, 2024
3 checks passed

squeaky-pl deleted the python-310 branch October 29, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3.10 #959

Python 3.10 #959

squeaky-pl commented Oct 28, 2024

wojcikstefan Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024

wojcikstefan Oct 29, 2024

squeaky-pl Oct 29, 2024 •

edited

Loading

wojcikstefan Oct 29, 2024

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024

wojcikstefan left a comment

squeaky-pl commented Oct 29, 2024

Python 3.10 #959

Python 3.10 #959

Conversation

squeaky-pl commented Oct 28, 2024

wojcikstefan Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

wojcikstefan Oct 29, 2024

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

wojcikstefan Oct 29, 2024

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

wojcikstefan left a comment

Choose a reason for hiding this comment

squeaky-pl commented Oct 29, 2024

wojcikstefan Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading

squeaky-pl Oct 29, 2024 •

edited

Loading