Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of feedparser.NonXMLContentType #171

Closed
lemon24 opened this issue Jun 20, 2020 · 1 comment
Closed

Better handling of feedparser.NonXMLContentType #171

lemon24 opened this issue Jun 20, 2020 · 1 comment
Labels

Comments

@lemon24
Copy link
Owner

lemon24 commented Jun 20, 2020

https://www.cockroachlabs.com/blog/index.xml fails with:

feedparser.NonXMLContentType: text/html; charset=UTF-8 is not an XML media type

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/lemon/code/reader/src/reader/core.py", line 398, in _parse_feed_for_update
    return feed, self._parser(feed.url, feed.http_etag, feed.http_last_modified)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 198, in __call__
    return self._parse_http(url, http_etag, http_last_modified)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 299, in _parse_http
    feed, entries = _process_feed(url, result)
  File "/Users/lemon/code/reader/src/reader/_parser.py", line 162, in _process_feed
    raise ParseError(url) from exception
reader.exceptions.ParseError: https://www.cockroachlabs.com/blog/index.xml

But, the feed does have data:

>>> import feedparser                                                                                 
>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.bozo, f.bozo_exception                                                                          
(1, URLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))
>>> import ssl                                                                                        
>>> ssl._create_default_https_context = ssl._create_unverified_context                                
>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.bozo, f.bozo_exception                                                                          
(1, NonXMLContentType('text/html; charset=UTF-8 is not an XML media type'))
>>> f.feed.title                                                                                      
'Cockroach Labs'
>>> len(f.entries)                                                                                    
200
>>> f.entries[0].title                                                                                
'Build an App with Active Record + CockroachDB'

Could we handle this in a manner similar to CharacterEncodingOverride?

if isinstance(exception, feedparser.CharacterEncodingOverride):
log.warning("parse %s: got %r", url, exception)

@lemon24 lemon24 added the core label Jun 20, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Jun 24, 2020

Funnily enough, both NonXMLContentType and CharacterEncodingOverride subclass ThingsNobodyCaresAboutButMe.

As a sanity check, it is probably a good idea to skip feeds for which feedparser can't detect a version:

>>> f = feedparser.parse('https://www.cockroachlabs.com/blog/index.xml')                              
>>> f.version                                                                                         
'rss20'
>>> f = feedparser.parse('https://github.com/lemon24/reader/issues/171')                              
>>> f.version                                                                                         
''

lemon24 added a commit that referenced this issue Jun 24, 2020
lemon24 added a commit that referenced this issue Jul 2, 2020
@lemon24 lemon24 closed this as completed Jul 2, 2020
lemon24 added a commit that referenced this issue Jan 25, 2021
For #108, Content-Type was set to text/xml if missing;
in #171, we added more general handling for that problem,
but the #108 code remained.

Part of #205 refactoring / cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant