Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.10 #959

Merged
merged 3 commits into from
Oct 29, 2024
Merged

Python 3.10 #959

merged 3 commits into from
Oct 29, 2024

Conversation

squeaky-pl
Copy link
Contributor

Supersedes: #439

@squeaky-pl squeaky-pl mentioned this pull request Oct 28, 2024
2 tasks
@wojcikstefan wojcikstefan mentioned this pull request Oct 29, 2024
Comment on lines 72 to 77
try:
s.feed(html)
except AssertionError as e:
if e.args[0].startswith("unknown status keyword"):
raise HTMLParseError(*e.args) from e
raise
Copy link
Member

@wojcikstefan wojcikstefan Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why was this change needed?

Copy link
Contributor Author

@squeaky-pl squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python parser in Python 3.10 started throwing AssertionErrors on some invalid HTML, those were previously HTMLParserError. I think it's opinionated weather it should be AssertionErrors , especially when treating with user supplied data (which emails are). To me AssertionErrors are programmer errors, not user errors. We actually have a test for those, meaning that at some point Nylas people hit this kind of HTML that would now trigger AssertionError.

Copy link
Contributor Author

@squeaky-pl squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the failure here.

    def test_strip_tags():
        text = (
            "<div><script> AAH JAVASCRIPT</script><style> AAH CSS AHH</style>"
            'check out this <a href="http://example.com/">link</a> yo!</div>'
        )
        assert strip_tags(text).strip() == "check out this link yo!"
    
        # MS Word conditional marked section
        text = "<![if word]>content<![endif]>"
    
        assert strip_tags(text).strip() == "content"
    
        # Unknown marked section
        text = """<![FOR]>"""
    
        with pytest.raises(HTMLParseError):
>           strip_tags(text)

    ...

    def parse_marked_section(self, i, report=1):
        rawdata= self.rawdata
        assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
        sectName, j = self._scan_name( i+3, i )
        if j < 0:
            return j
        if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
            # look for standard ]]> ending
            match= _markedsectionclose.search(rawdata, i+3)
        elif sectName in {"if", "else", "endif"}:
            # look for MS Office ]> ending
            match= _msmarkedsectionclose.search(rawdata, i+3)
        else:
>           raise AssertionError(
                'unknown status keyword %r in marked section' % rawdata[i+3:j]
            )
E           AssertionError: unknown status keyword 'FOR' in marked section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojcikstefan should I add clarification comment maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, makes sense 👍 Yea, I think it's worth adding a comment – it might not be clear at a glance why we don't work directly with the AssertionError further up the stack.

Are there other invalid HTML errors that will still raise an AssertionError? Should we re-raise all s.feed(html) assertion errors as HTML parser errors? Or does s.feed do more than just parse?

Copy link
Contributor Author

@squeaky-pl squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally if those would become AssertionErrors we would not catch them here

    def calculate_html_snippet(self, text: str) -> str:
        try:
            text = strip_tags(text)
        except HTMLParseError:
            log.error(
                "error stripping tags", message_nylas_uid=self.nylas_uid, exc_info=True
            )
            text = ""

        return self.calculate_plaintext_snippet(text)

meaning that we would discard the entire email, instead of discarding just the snippet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We technically could replace except HTMLParserError with except AssertionError though, right? Or are there other AssertionErrors being thrown that we should definitely NOT catch?

Copy link
Contributor Author

@squeaky-pl squeaky-pl Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojcikstefan I did a deep dive, found the change in CPython that causes this:

python/cpython@e34bbfd

As you can see they have removed the error callback, we were using the error callback to raise our exception. But that method was long deprecated. I still think the choice to raise AssertionError is the wrong choice on CPython developers, I think a subclass of ValueError would be better. So I agree we should catch all the AssertionErrors.

There are some AssertionErrors that are programming errors in the source of the parser, like this one, but they should not happen.

Thanks for catching this. Will follow with a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojcikstefan Pushed a fix in 4ae8c49 (#959). Assigning you as a reviewer as well because you spent a lot of time looking at this already.

Copy link
Member

@wojcikstefan wojcikstefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Do you think it's worth rolling out partially first to confirm we didn't miss anything?

@squeaky-pl
Copy link
Contributor Author

@wojcikstefan Yes, I will definitely canary this.

@squeaky-pl squeaky-pl merged commit 1a6d56d into master Oct 29, 2024
3 checks passed
@squeaky-pl squeaky-pl deleted the python-310 branch October 29, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants