-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_markupbase.py fails with UnboundLocalError on invalid keyword in marked section #78661
Comments
$ pip freeze | grep beautifulsoup4
beautifulsoup4==4.6.3 $ python
Python 3.7.0 (default, Jul 23 2018, 20:24:19)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> text = "<![hi world]>"
>>> BeautifulSoup(text, 'html.parser')
/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py:78: UserWarning: unknown status keyword 'hi ' in marked section
warnings.warn(msg)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 282, in __init__
self._feed()
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "/Users/conradroche/solvvy/ml-pipeline/venv/lib/python3.7/site-packages/bs4/builder/_htmlparser.py", line 247, in feed
parser.feed(markup)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 111, in feed
self.goahead(0)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/html/parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
>>> |
Thanks for the report. HTMLParser.error() was supposed to raise an exception, but the BeautifulSoup project just prints a warning here: def error(self, msg):
warnings.warn(msg) https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L69 As a result of this, the code doesn't stop executing in the following branch:
https://github.com/python/cpython/blob/3.7/Lib/_markupbase.py#L159 Note that HTMLParser.error() was removed in Python 3.5 (73a4359#diff-1a7486df8279dbac7f20abd487947845L171) and there is an open issue about the status of _markupbase.ParserBase.error(): bpo-31844. I also think that 73a4359#diff-1a7486df8279dbac7f20abd487947845L171 may have caused a minor regression when it was removed the error() method and its uses from the HTMLParser class. It still calls the parse_marked_section() method of _markupbase.ParserBase() which it then calls the error() method of _markupbase.ParserBase():
https://github.com/python/cpython/blob/3.7/Lib/html/parser.py#L264 |
Attaching a test case for this issue since the self.error code path was not covered in the current test suite. A PR is open proposing match variable initialisation with None #17643 . diff --git Lib/test/test_htmlparser.py Lib/test/test_htmlparser.py
index a2bfb39d16..a9aff11706 100644
--- Lib/test/test_htmlparser.py
+++ Lib/test/test_htmlparser.py
@@ -766,6 +766,18 @@ class AttributesTestCase(TestCaseBase):
[("href", "http://www.example.org/\">;")]),
("data", "spam"), ("endtag", "a")])
+ def test_invalid_keyword_error_exception(self):
+ class InvalidMarkupException(Exception):
+ pass
+
+ class MyHTMLParser(html.parser.HTMLParser):
+
+ def error(self, message):
+ raise InvalidMarkupException(message)
+
+ parser = MyHTMLParser()
+ with self.assertRaises(InvalidMarkupException):
+ parser.feed('<![invalid>')
if __name__ == "__main__":
unittest.main() |
(Author of PR #17643) Since the behavior of self.error() is determined by the subclass implementation, an Exception is not guaranteed. How should this be handled? It seems the options are:
Happy to update the PR with @XTreak's test cases. |
HTMLParser is supposed to follow the HTML5 standard, and never raise an error. For the example in the first comment ("<![hi world]>"), the steps should be:
I agree that the error should be fixed by setting However, it also seems wrong that HTMLParser ends up calling self.error() through Lib/_markupbase.py ParserBase after HTMLParser.error() and all the calls to it have been removed. _markupbase.py is internal, so it should be safe to remove ParserBase.error() and the code that calls it as suggested in bpo-31844 (and possibly to merge _markupbase into html.parser too). Even if this is done and the call to self.error() is removed from ParserBase.parse_marked_section(), |
Should I reopen #8562 then? |
I think so. |
Here's a simplified real world example that I found if it helps anyone: import bs4
html = '''<html><head><!fill in here--> <![end--><!-- start: SSI y-xtra-topspace.shtml --><!-- --></head><body></body></html>'''
soup = bs4.BeautifulSoup(html, 'html.parser') |
For reference, my Python version is 3.8.2 and my bs4 version is 4.8.1 |
I updated to bs4 version 4.9.0 and the same issue occurs. |
I'm the maintainer of Beautiful Soup. I learned about this issue when one of my users filed a bug for it against Beautiful Soup (https://bugs.launchpad.net/beautifulsoup/+bug/1883264). BeautifulSoupHTMLParser (my subclass of html.parser.HTMLParser) implements error() only to prevent the NotImplementedError raised by ParserBase.error(). The docstring for my error() implementation says: "this method is called only on very strange markup and our best strategy is to pretend it didn't happen and keep going." It shouldn't affect Beautiful Soup if, in a future version of Python, HTMLParser.error is never called and ParserBase is removed. |
Based on the changelog, it seems that the fix for this issue has only landed in Python 3.10 but it's still affecting versions 3.7, 3.8 and 3.9. There are a few projects such as Apache Beam that have only recently added support for Python 3.9 with no support for 3.10. Beautiful Soup is a popular tool that's affected by this bug and so my question is if backporting this bugfix to at least Python 3.9 which is still an officially supported release is on the table? |
Alright, the PR is up. This is my first contribution to Python and I had to sign the Contributor Agreement so it may take 1 business day before you're able to accept it. In the meantime, can you please review and let me know if anything needs to change? I wasn't sure where / how to add an entry in the NEWS.d folder. Can you please help me? Thanks! |
any question about this issue? if not, i think it is better to close it |
This has been fixed in these PRs so it can be closed: |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: