Better handling of feedparser.NonXMLContentType #171

lemon24 · 2020-06-20T14:00:15Z

https://www.cockroachlabs.com/blog/index.xml fails with:

feedparser.NonXMLContentType: text/html; charset=UTF-8 is not an XML media type

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/lemon/code/reader/src/reader/core.py", line 398, in _parse_feed_for_update
    return feed, self._parser(feed.url, feed.http_etag, feed.http_last_modified)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 198, in __call__
    return self._parse_http(url, http_etag, http_last_modified)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 299, in _parse_http
    feed, entries = _process_feed(url, result)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 162, in _process_feed
    raise ParseError(url) from exception
reader.exceptions.ParseError: https://www.cockroachlabs.com/blog/index.xml

But, the feed does have data:

>>> import feedparser                                                                                 
>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.bozo, f.bozo_exception                                                                          
(1, URLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))
>>> import ssl                                                                                        
>>> ssl._create_default_https_context = ssl._create_unverified_context                                
>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.bozo, f.bozo_exception                                                                          
(1, NonXMLContentType('text/html; charset=UTF-8 is not an XML media type'))
>>> f.feed.title                                                                                      
'Cockroach Labs'
>>> len(f.entries)                                                                                    
200
>>> f.entries[0].title                                                                                
'Build an App with Active Record + CockroachDB'

Could we handle this in a manner similar to CharacterEncodingOverride?

reader/src/reader/_parser.py

Lines 157 to 158 in 3076a57

    
           if isinstance(exception, feedparser.CharacterEncodingOverride): 
        
               log.warning("parse %s: got %r", url, exception)

The text was updated successfully, but these errors were encountered:

lemon24 · 2020-06-24T14:20:19Z

Funnily enough, both NonXMLContentType and CharacterEncodingOverride subclass ThingsNobodyCaresAboutButMe.

As a sanity check, it is probably a good idea to skip feeds for which feedparser can't detect a version:

>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.version                                                                                         
'rss20'
>>> f = feedparser.parse('https://github.com/lemon24/reader/issues/171')                              
>>> f.version                                                                                         
''

For #171.

For #108, Content-Type was set to text/xml if missing; in #171, we added more general handling for that problem, but the #108 code remained. Part of #205 refactoring / cleanup.

lemon24 added the core label Jun 20, 2020

lemon24 added a commit that referenced this issue Jun 24, 2020

Don't fail for feedparser.NonXMLContentType.

44a3585

For #171.

lemon24 added a commit that referenced this issue Jul 2, 2020

Raise ParseError for feeds with no version.

992d11b

For #171.

lemon24 closed this as completed Jul 2, 2020

lemon24 mentioned this issue Jul 31, 2020

Cloudflare doesn't like the reader user agent #181

Closed

lemon24 added a commit that referenced this issue Jan 25, 2021

If Content-Type is missing, don't set it.

1856c0f

For #108, Content-Type was set to text/xml if missing; in #171, we added more general handling for that problem, but the #108 code remained. Part of #205 refactoring / cleanup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of feedparser.NonXMLContentType #171

Better handling of feedparser.NonXMLContentType #171

lemon24 commented Jun 20, 2020 •

edited

Loading

lemon24 commented Jun 24, 2020

Better handling of feedparser.NonXMLContentType #171

Better handling of feedparser.NonXMLContentType #171

Comments

lemon24 commented Jun 20, 2020 • edited Loading

lemon24 commented Jun 24, 2020

lemon24 commented Jun 20, 2020 •

edited

Loading