Consider using defusedxml #212

lemon24 · 2021-01-24T09:04:43Z

We have two (obvious) options here:

contribute to feedparser, see the issue linked above (I don't know if it's easy to do)
pass the XML stream through defusedxml before passing it to feedparser
- this will likely remove feedparser's ability to deal with broken xml (if it still can do that)

lemon24 · 2021-06-02T14:47:32Z

This is one way of doing it:

feedparser.api.PREFERRED_XML_PARSERS.insert(0, 'defusedxml.expatreader')

Note that it's global; maybe we can make a full copy of the feedparser.api module at runtime, to avoid monkeypatching? Update: https://stackoverflow.com/a/11285504

An alternative is to (have the user) use defusedxml.defuse_stdlib() (unsupported) by themselves and be done with it.

lemon24 · 2021-07-25T09:45:31Z

There's one of my feeds that fails when I try the above.

unexpected error while reading feed: 'http://www.xn--8ws00zhy3a.com/feed': defusedxml.common.EntitiesForbidden: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)

The feed looks like this (note where &id; is used; it's likely critical to actually expand it):

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE feed [
  <!ENTITY xhtml "http://www.w3.org/1999/xhtml">
  <!ENTITY id "tag:xn--8ws00zhy3a.com,">
]>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb" xml:base="http://www.詹姆斯.com/feed">

...

  <entry>
    <title>HTML Kong</title>
    <id>&id;2016-07-18:/blog/16</id>
    <updated>2016-07-18T05:14:09+00:00</updated>
    <summary type="xhtml">
      <div xmlns="&xhtml;">

I think this shows the need to be able to whitelist feeds.

It would be nice if this were granular (just allow "xhtml" and "id"), but TBD that seems like a poor user experience. Also, if the feed becomes malicious, they could just change what the whitelisted entities mean. A single "trust this feed" is likely enough.

lemon24 · 2021-11-01T17:10:16Z

FWIW, using lxml may be good enough: https://pypi.org/project/defusedxml/#python-xml-libraries (most of the vulnerabilities are marked with False, and we may not care about those marked with True; needs looking into).

Here's roughly what we need to do to use lxml with feedparser:

import lxml.etree, lxml.sax

class XMLParser:
    def __init__(self):
        self.handler = None
    def setFeature(self, *args):
        # we need to support/assert at least these:
        #setFeature('http://xml.org/sax/features/namespaces', 1)
        #setFeature('http://xml.org/sax/features/external-general-entities', 0)
        pass
    def setContentHandler(self, handler):
        assert not self.handler
        self.handler = handler
    def setErrorHandler(self, handler):
        assert self.handler is handler
    def parse(self, source):
        parser = lxml.etree.XMLParser(recover=True)
        tree = lxml.etree.parse(source.getByteStream(), parser)
        return lxml.sax.saxify(tree, self.handler)
            
    
def create_parser(encoding=None):
    return XMLParser()

import feedparser

feedparser.api.PREFERRED_XML_PARSERS.insert(0, '__main__')

lemon24 mentioned this issue Apr 27, 2021

Make default_parser() part of the public API #235

Closed

This was referenced Nov 18, 2021

Consider using Atoma #263

Closed

Consider supporting alternative feed parsers #264

Closed

lemon24 mentioned this issue Nov 29, 2021

Consider working around feedparser's issues #265

Open

lemon24 mentioned this issue Dec 18, 2021

Plan for upcoming 3x memory improvement PR (and a few others) kurtmckee/feedparser#296

Open

lemon24 added core feed parsing labels Jan 29, 2022

LukeMurphey mentioned this issue Feb 15, 2023

Feedparser parsing of untrusted XML is not ideal LukeMurphey/splunk-syndication-input#12

Open

lemon24 mentioned this issue Aug 31, 2024

test_parser.py fails most tests when running with vendored feedparser #350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using defusedxml #212

Consider using defusedxml #212

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jun 2, 2021 •

edited

Loading

lemon24 commented Jul 25, 2021 •

edited

Loading

lemon24 commented Nov 1, 2021 •

edited

Loading

Consider using defusedxml #212

Consider using defusedxml #212

Comments

lemon24 commented Jan 24, 2021 • edited Loading

lemon24 commented Jun 2, 2021 • edited Loading

lemon24 commented Jul 25, 2021 • edited Loading

lemon24 commented Nov 1, 2021 • edited Loading

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jun 2, 2021 •

edited

Loading

lemon24 commented Jul 25, 2021 •

edited

Loading

lemon24 commented Nov 1, 2021 •

edited

Loading