-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using defusedxml #212
Comments
This is one way of doing it: feedparser.api.PREFERRED_XML_PARSERS.insert(0, 'defusedxml.expatreader') Note that it's global; maybe we can make a full copy of the feedparser.api module at runtime, to avoid monkeypatching? Update: https://stackoverflow.com/a/11285504 An alternative is to (have the user) use |
There's one of my feeds that fails when I try the above.
The feed looks like this (note where <?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE feed [
<!ENTITY xhtml "http://www.w3.org/1999/xhtml">
<!ENTITY id "tag:xn--8ws00zhy3a.com,">
]>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb" xml:base="http://www.詹姆斯.com/feed">
...
<entry>
<title>HTML Kong</title>
<id>&id;2016-07-18:/blog/16</id>
<updated>2016-07-18T05:14:09+00:00</updated>
<summary type="xhtml">
<div xmlns="&xhtml;"> I think this shows the need to be able to whitelist feeds. It would be nice if this were granular (just allow "xhtml" and "id"), but TBD that seems like a poor user experience. Also, if the feed becomes malicious, they could just change what the whitelisted entities mean. A single "trust this feed" is likely enough. |
FWIW, using lxml may be good enough: https://pypi.org/project/defusedxml/#python-xml-libraries (most of the vulnerabilities are marked with False, and we may not care about those marked with True; needs looking into). Here's roughly what we need to do to use lxml with feedparser: import lxml.etree, lxml.sax
class XMLParser:
def __init__(self):
self.handler = None
def setFeature(self, *args):
# we need to support/assert at least these:
#setFeature('http://xml.org/sax/features/namespaces', 1)
#setFeature('http://xml.org/sax/features/external-general-entities', 0)
pass
def setContentHandler(self, handler):
assert not self.handler
self.handler = handler
def setErrorHandler(self, handler):
assert self.handler is handler
def parse(self, source):
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(source.getByteStream(), parser)
return lxml.sax.saxify(tree, self.handler)
def create_parser(encoding=None):
return XMLParser()
import feedparser
feedparser.api.PREFERRED_XML_PARSERS.insert(0, '__main__') |
https://pypi.org/project/defusedxml/
Related issue: kurtmckee/feedparser#107
We have two (obvious) options here:
The text was updated successfully, but these errors were encountered: