Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using defusedxml #212

Open
lemon24 opened this issue Jan 24, 2021 · 3 comments
Open

Consider using defusedxml #212

lemon24 opened this issue Jan 24, 2021 · 3 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Jan 24, 2021

https://pypi.org/project/defusedxml/

Related issue: kurtmckee/feedparser#107

We have two (obvious) options here:

  • contribute to feedparser, see the issue linked above (I don't know if it's easy to do)
  • pass the XML stream through defusedxml before passing it to feedparser
    • this will likely remove feedparser's ability to deal with broken xml (if it still can do that)
@lemon24
Copy link
Owner Author

lemon24 commented Jun 2, 2021

This is one way of doing it:

feedparser.api.PREFERRED_XML_PARSERS.insert(0, 'defusedxml.expatreader')

Note that it's global; maybe we can make a full copy of the feedparser.api module at runtime, to avoid monkeypatching? Update: https://stackoverflow.com/a/11285504

An alternative is to (have the user) use defusedxml.defuse_stdlib() (unsupported) by themselves and be done with it.

@lemon24
Copy link
Owner Author

lemon24 commented Jul 25, 2021

There's one of my feeds that fails when I try the above.

unexpected error while reading feed: 'http://www.xn--8ws00zhy3a.com/feed': defusedxml.common.EntitiesForbidden: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)

The feed looks like this (note where &id; is used; it's likely critical to actually expand it):

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE feed [
  <!ENTITY xhtml "http://www.w3.org/1999/xhtml">
  <!ENTITY id "tag:xn--8ws00zhy3a.com,">
]>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb" xml:base="http://www.詹姆斯.com/feed">

...

  <entry>
    <title>HTML Kong</title>
    <id>&id;2016-07-18:/blog/16</id>
    <updated>2016-07-18T05:14:09+00:00</updated>
    <summary type="xhtml">
      <div xmlns="&xhtml;">

I think this shows the need to be able to whitelist feeds.

It would be nice if this were granular (just allow "xhtml" and "id"), but TBD that seems like a poor user experience. Also, if the feed becomes malicious, they could just change what the whitelisted entities mean. A single "trust this feed" is likely enough.

@lemon24
Copy link
Owner Author

lemon24 commented Nov 1, 2021

FWIW, using lxml may be good enough: https://pypi.org/project/defusedxml/#python-xml-libraries (most of the vulnerabilities are marked with False, and we may not care about those marked with True; needs looking into).

Here's roughly what we need to do to use lxml with feedparser:

import lxml.etree, lxml.sax

class XMLParser:
    def __init__(self):
        self.handler = None
    def setFeature(self, *args):
        # we need to support/assert at least these:
        #setFeature('http://xml.org/sax/features/namespaces', 1)
        #setFeature('http://xml.org/sax/features/external-general-entities', 0)
        pass
    def setContentHandler(self, handler):
        assert not self.handler
        self.handler = handler
    def setErrorHandler(self, handler):
        assert self.handler is handler
    def parse(self, source):
        parser = lxml.etree.XMLParser(recover=True)
        tree = lxml.etree.parse(source.getByteStream(), parser)
        return lxml.sax.saxify(tree, self.handler)
            
    
def create_parser(encoding=None):
    return XMLParser()

import feedparser

feedparser.api.PREFERRED_XML_PARSERS.insert(0, '__main__')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant