Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Feed support #205

Closed
lemon24 opened this issue Dec 23, 2020 · 4 comments
Closed

JSON Feed support #205

lemon24 opened this issue Dec 23, 2020 · 4 comments
Labels

Comments

@lemon24
Copy link
Owner

lemon24 commented Dec 23, 2020

https://en.m.wikipedia.org/wiki/JSON_Feed

https://jsonfeed.org/

Asked about in https://www.reddit.com/r/selfhosted/comments/kioq3g/comment/ggs3kuk?context=3


Question: Is this worth supporting, or a case of featuritis?

The Wikipedia page mentions NPR as a publisher that supports it, and the latest version of the spec mentions about 10 other websites.

Update: Here's some more users: https://indieweb.org/JSON_Feed

We could make it a plug-in.


Regardless of the support required, this is an interesting use case, since to implement it as a separate parser we'd need a way of delegating by extension and/or MIME type.

At the moment, we can only delegate to a parser by feed URL prefix (and making people add "json+http://..." to their feeds is not exactly user friendly).

@lemon24 lemon24 added the core label Dec 23, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Jan 24, 2021

OK, to implement this in a modular way, we'll split the current "subparsers" (HTTPParser/FileParser) into a Retriever and a (Sub)Parser.

The Retriever:

  • Is selected by URL prefix (like subparsers are now).
  • Arguments:
    • URL
    • optional caching headers
    • Accept headers from all the known parsers
  • Returns:
    • file-like object
    • optional MIME type
    • optional caching headers
    • optional response HTTP headers
  • If no MIME type is returned, it's guessed from the URL using the mimetypes stdlib module.

The (Sub)Parser:

  • Is selected by the MIME type returned by the parser. (We should probably have feedparser as a fallback when no MIME type can be guessed, for backwards compatibility.)
    • Should there be a way to special-case an URL (prefix)? How do we support plugins like sqlite_releases?
    • How? Exact match? Do we support type/* and */*? Should application/unknown+xml fall back to application/xml?
      • feedparser uses the following Accept headers at the moment: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1 (note the */* catchall).
      • JSON Feed uses application/json (v1) and application/feed+json.
  • Arguments:
    • URL
    • file object
    • response HTTP headers
  • Returns: the parsed feed.

Here's pseudo-code of how they all fit together in the (Meta)Parser (the current Parser class):

# input
url: str = ...
# currently http_etag and http_last_modified
caching_headers: dict = ...

# actually stored on a Parser instance
RETRIEVERS = [HTTPRetriever(), FileRetriever()]
PARSERS = [JSONFeedParser(), FeedparserParser()]

# actually a Parser method
retriever = get_retriever(url)

http_accept = merge_accept_headers(p.accept_headers for p in PARSERS)
    
file, mime_type, caching_headers, headers = retriever.get(
    url, caching_headers, http_accept
)
if not mime_type:
    mime_type = mimetype.guess_type(url)

# actually a Parser method
parser = get_parser(mime_type)

parsed_feed = parser(url, file, headers)

rv = parsed_feed, caching_headers

Here's how (sub)parser selection works:

from werkzeug.datastructures import MIMEAccept
from werkzeug.http import parse_accept_header, parse_options_header

# the accept headers come from parser.accept_header,
# except for the wildcard, which is added manually;
# in practice, feedparser and feedparser (catch-all) are the same object
PARSERS = [
    (parse_accept_header(a, MIMEAccept), parser)
    for a, parser in [
        # everything in feedparser.http.ACCEPT, except the wildcard (*/*);
        # only a few included for brevity
        ("application/atom+xml,application/xml;q=0.9", "feedparser"),
        ("application/feed+json,application/json;q=0.9", "jsonfeed"),
        # for backwards compatibility
        ("*/*;q=0.1", "feedparser (catch-all)"),
    ]
]

def get_parser(mime_type):
    for accept, parser in PARSERS:
        if accept.best_match([mime_type]):
            return parser

def merge_accept_headers():
    values = []
    for accept, _ in PARSERS:
        values.extend(accept)
    return MIMEAccept(values).to_header()

print(merge_accept_headers())

content_types = [
    "application/xml; charset=ISO-8859-1",
    "application/xml",
    "application/whatever+xml",
    "application/json",
    "unknown/type",
]

for content_type in content_types:
    mime_type, _ = parse_options_header(content_type)
    print(content_type, '->', get_parser(mime_type))

"""
application/atom+xml,application/feed+json,application/xml;q=0.9,application/json;q=0.9,*/*;q=0.1
application/xml; charset=ISO-8859-1 -> feedparser
application/xml -> feedparser
application/whatever+xml -> feedparser (catch-all)
application/json -> jsonfeed
unknown/type -> feedparser (catch-all)
"""

@lemon24
Copy link
Owner Author

lemon24 commented Jan 24, 2021

To do:

  • decide how parser matching works
  • refactor current code
  • implement JSON Feed parser
  • documentation
    • [x] werkzeug dependency
    • changelog
    • index
    • docstrings (which?)
  • fix sqlite_releases
  • clean up _parser.py code
    • use type aliases
    • maybe move URL stuff into a module
    • reorder
    • docstrings
    • [ ] maybe get rid of caching_get
    • maybe get rid of _NotModified and use feed=None instead)
  • manual test

lemon24 added a commit that referenced this issue Jan 25, 2021
lemon24 added a commit that referenced this issue Jan 25, 2021
lemon24 added a commit that referenced this issue Jan 25, 2021
lemon24 added a commit that referenced this issue Jan 25, 2021
For #108, Content-Type was set to text/xml if missing;
in #171, we added more general handling for that problem,
but the #108 code remained.

Part of #205 refactoring / cleanup.
lemon24 added a commit that referenced this issue Jan 25, 2021
lemon24 added a commit that referenced this issue Jan 26, 2021
lemon24 added a commit that referenced this issue Jan 26, 2021
lemon24 added a commit that referenced this issue Jan 26, 2021
lemon24 added a commit that referenced this issue Jan 26, 2021
lemon24 added a commit that referenced this issue Jan 26, 2021
lemon24 added a commit that referenced this issue Jan 27, 2021
lemon24 added a commit that referenced this issue Jan 27, 2021
lemon24 added a commit that referenced this issue Jan 27, 2021
lemon24 added a commit that referenced this issue Jan 27, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
lemon24 added a commit that referenced this issue Jan 28, 2021
@lemon24
Copy link
Owner Author

lemon24 commented Jan 28, 2021

OK, I added / updated all the feeds below:
http://shapeof.com/feed.json
http://flyingmeat.com/blog/feed.json
http://maybepizza.com/feed.json
https://daringfireball.net/feeds/json
http://hypercritical.co/feeds/main.json
http://inessential.com/feed.json
https://manton.org/feed/json
https://micro.blog/feeds/manton.json
http://timetable.manton.org/feed.json
http://therecord.co/feed.json
http://www.allenpike.com/feed.json
https://jsonfeed.org/feed.json
https://adactio.com/articles/feed.json
https://jonnybarnes.uk/blog/feed.json
https://matthiasott.com/articles/feed.json
https://ascraeus.org/jsonfeed/index.json
https://feeds.npr.org/1019/feed.json
https://feeds.npr.org/510317/feed.json

Most things look fine: authors, dates, attachments, HTML, titles.

The only issue is that feed.updated isn't set (the spec doesn't specify one); we should use the newest entry for that.

Update: This is not only specific to JSON feeds, cut #214 for it.

lemon24 added a commit that referenced this issue Jan 29, 2021
@lemon24
Copy link
Owner Author

lemon24 commented Jan 29, 2021

Time spent:

             hours
thing             
design         2.5
refactoring    8.0
tests          2.5
json feed      5.0
cleanup        2.0
docs           0.5

20.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant