JSON Feed support #205

lemon24 · 2020-12-23T08:46:29Z

Question: Is this worth supporting, or a case of featuritis?

The Wikipedia page mentions NPR as a publisher that supports it, and the latest version of the spec mentions about 10 other websites.

Update: Here's some more users: https://indieweb.org/JSON_Feed

We could make it a plug-in.

Regardless of the support required, this is an interesting use case, since to implement it as a separate parser we'd need a way of delegating by extension and/or MIME type.

At the moment, we can only delegate to a parser by feed URL prefix (and making people add "json+http://..." to their feeds is not exactly user friendly).

lemon24 · 2021-01-24T10:04:15Z

OK, to implement this in a modular way, we'll split the current "subparsers" (HTTPParser/FileParser) into a Retriever and a (Sub)Parser.

The Retriever:

Is selected by URL prefix (like subparsers are now).
Arguments:
- URL
- optional caching headers
- Accept headers from all the known parsers
Returns:
- file-like object
- optional MIME type
- optional caching headers
- optional response HTTP headers
If no MIME type is returned, it's guessed from the URL using the mimetypes stdlib module.

The (Sub)Parser:

Is selected by the MIME type returned by the parser. (We should probably have feedparser as a fallback when no MIME type can be guessed, for backwards compatibility.)
- Should there be a way to special-case an URL (prefix)? How do we support plugins like sqlite_releases?
- How? Exact match? Do we support type/* and */*? Should application/unknown+xml fall back to application/xml?
  - feedparser uses the following Accept headers at the moment: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1 (note the */* catchall).
  - JSON Feed uses application/json (v1) and application/feed+json.
Arguments:
- URL
- file object
- response HTTP headers
Returns: the parsed feed.

Here's pseudo-code of how they all fit together in the (Meta)Parser (the current Parser class):

# input
url: str = ...
# currently http_etag and http_last_modified
caching_headers: dict = ...

# actually stored on a Parser instance
RETRIEVERS = [HTTPRetriever(), FileRetriever()]
PARSERS = [JSONFeedParser(), FeedparserParser()]

# actually a Parser method
retriever = get_retriever(url)

http_accept = merge_accept_headers(p.accept_headers for p in PARSERS)
    
file, mime_type, caching_headers, headers = retriever.get(
    url, caching_headers, http_accept
)
if not mime_type:
    mime_type = mimetype.guess_type(url)

# actually a Parser method
parser = get_parser(mime_type)

parsed_feed = parser(url, file, headers)

rv = parsed_feed, caching_headers

Here's how (sub)parser selection works:

from werkzeug.datastructures import MIMEAccept
from werkzeug.http import parse_accept_header, parse_options_header

# the accept headers come from parser.accept_header,
# except for the wildcard, which is added manually;
# in practice, feedparser and feedparser (catch-all) are the same object
PARSERS = [
    (parse_accept_header(a, MIMEAccept), parser)
    for a, parser in [
        # everything in feedparser.http.ACCEPT, except the wildcard (*/*);
        # only a few included for brevity
        ("application/atom+xml,application/xml;q=0.9", "feedparser"),
        ("application/feed+json,application/json;q=0.9", "jsonfeed"),
        # for backwards compatibility
        ("*/*;q=0.1", "feedparser (catch-all)"),
    ]
]

def get_parser(mime_type):
    for accept, parser in PARSERS:
        if accept.best_match([mime_type]):
            return parser

def merge_accept_headers():
    values = []
    for accept, _ in PARSERS:
        values.extend(accept)
    return MIMEAccept(values).to_header()

print(merge_accept_headers())

content_types = [
    "application/xml; charset=ISO-8859-1",
    "application/xml",
    "application/whatever+xml",
    "application/json",
    "unknown/type",
]

for content_type in content_types:
    mime_type, _ = parse_options_header(content_type)
    print(content_type, '->', get_parser(mime_type))

"""
application/atom+xml,application/feed+json,application/xml;q=0.9,application/json;q=0.9,*/*;q=0.1
application/xml; charset=ISO-8859-1 -> feedparser
application/xml -> feedparser
application/whatever+xml -> feedparser (catch-all)
application/json -> jsonfeed
unknown/type -> feedparser (catch-all)
"""

lemon24 · 2021-01-24T17:45:31Z

For #205.

For #108, Content-Type was set to text/xml if missing; in #171, we added more general handling for that problem, but the #108 code remained. Part of #205 refactoring / cleanup.

For #205.

This reverts commit 2acb131.

For #205.

…ore tests). For #205.

For #205.

Related to #205.

For #205.

lemon24 · 2021-01-28T23:16:09Z

OK, I added / updated all the feeds below:

http://shapeof.com/feed.json
http://flyingmeat.com/blog/feed.json
http://maybepizza.com/feed.json
https://daringfireball.net/feeds/json
http://hypercritical.co/feeds/main.json
http://inessential.com/feed.json
https://manton.org/feed/json
https://micro.blog/feeds/manton.json
http://timetable.manton.org/feed.json
http://therecord.co/feed.json
http://www.allenpike.com/feed.json
https://jsonfeed.org/feed.json
https://adactio.com/articles/feed.json
https://jonnybarnes.uk/blog/feed.json
https://matthiasott.com/articles/feed.json
https://ascraeus.org/jsonfeed/index.json
https://feeds.npr.org/1019/feed.json
https://feeds.npr.org/510317/feed.json

Most things look fine: authors, dates, attachments, HTML, titles.

The only issue is that feed.updated isn't set (the spec doesn't specify one); we should use the newest entry for that.

Update: This is not only specific to JSON feeds, cut #214 for it.

lemon24 · 2021-01-29T12:27:14Z

Time spent:

             hours
thing             
design         2.5
refactoring    8.0
tests          2.5
json feed      5.0
cleanup        2.0
docs           0.5

20.5

lemon24 added the core label Dec 23, 2020

lemon24 added a commit that referenced this issue Jan 25, 2021

Extract parsing logic from "retrievers".

8180be6

For #205.

lemon24 added a commit that referenced this issue Jan 25, 2021

_parser.py: parser -> retriever.

e1ea795

For #205.

lemon24 added a commit that referenced this issue Jan 25, 2021

_parser.py: select parsers by MIME type.

5d631b3

For #205.

lemon24 added a commit that referenced this issue Jan 25, 2021

Use parsers' accept_header to build the session accept header.

0ec979f

For #205.

lemon24 added a commit that referenced this issue Jan 25, 2021

If Content-Type is missing, don't set it.

1856c0f

For #108, Content-Type was set to text/xml if missing; in #171, we added more general handling for that problem, but the #108 code remained. Part of #205 refactoring / cleanup.

lemon24 added a commit that referenced this issue Jan 25, 2021

Turn parse_feed into an AwareParserType.

f16508a

For #205.

lemon24 added a commit that referenced this issue Jan 26, 2021

_parser.py: move header setting in Parser.__call__().

d20718c

For #205.

lemon24 added a commit that referenced this issue Jan 26, 2021

Fix docs build.

fbdfc40

For #205.

lemon24 added a commit that referenced this issue Jan 26, 2021

Make werkzeug a dependency, for #205.

2acb131

lemon24 added a commit that referenced this issue Jan 26, 2021

Revert "Make werkzeug a dependency, for #205."

9622c29

This reverts commit 2acb131.

lemon24 added a commit that referenced this issue Jan 26, 2021

Refactor _parser.py so we can get rid of the werkzeug dependency.

c381ade

For #205.

lemon24 added a commit that referenced this issue Jan 26, 2021

Remove _parser.py werkzeug dependency.

25a0c0a

For #205.

lemon24 added a commit that referenced this issue Jan 26, 2021

Remove _parser.py werkzeug dependency.

99da826

For #205.

lemon24 added a commit that referenced this issue Jan 27, 2021

Move parser HTTP stuff to its own module.

97da61b

For #205.

lemon24 added a commit that referenced this issue Jan 27, 2021

Test parser infra.

ea3caad

For #205.

lemon24 added a commit that referenced this issue Jan 27, 2021

Fix sqlite_releases (add _parser.py support for it, and a test).

c222c30

For #205.

lemon24 added a commit that referenced this issue Jan 27, 2021

_parser.py: back to 100% coverage.

57aa082

For #205.

lemon24 added a commit that referenced this issue Jan 27, 2021

Fix sqlite_releases test failure on windows.

9d08516

For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

JSON Feed parser (still needs types, better exception handling, and m…

a4c6a7d

…ore tests). For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

_parser.py: back to 100% coverage after adding the JSON Feed parser.

6519db6

For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

Fix _parser.py JSON Feed typing.

4c129d5

For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

Bump flaky PyPy SQLite test max_runs to 10.

d1fe926

Related to #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

Hopefully fix Windows parser tests.

cbfc18f

Related to #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

Add .gitattributes to prevent parser tests from breaking.

6221feb

Related to #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

_parser.py: move URL-related stuff to a new module.

60b6225

For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

_parser.py: antipodes (reorder stuff).

a0b1ded

For #205.

lemon24 added a commit that referenced this issue Jan 28, 2021

_parser.py: antipodes (reorder stuff).

b09f5f2

For #205.

lemon24 mentioned this issue Jan 29, 2021

Some feeds have .updated None #214

Open

lemon24 added a commit that referenced this issue Jan 29, 2021

Docs for #205.

7dae35f

lemon24 closed this as completed Jan 29, 2021

lemon24 mentioned this issue Feb 15, 2021

I don't know what's happening during update_feeds() #204

Closed

lemon24 mentioned this issue Mar 5, 2021

Some websites don't have feeds #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Feed support #205

JSON Feed support #205

lemon24 commented Dec 23, 2020 •

edited

Loading

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jan 28, 2021 •

edited

Loading

lemon24 commented Jan 29, 2021

JSON Feed support #205

JSON Feed support #205

Comments

lemon24 commented Dec 23, 2020 • edited Loading

lemon24 commented Jan 24, 2021 • edited Loading

lemon24 commented Jan 24, 2021 • edited Loading

lemon24 commented Jan 28, 2021 • edited Loading

lemon24 commented Jan 29, 2021

lemon24 commented Dec 23, 2020 •

edited

Loading

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jan 24, 2021 •

edited

Loading

lemon24 commented Jan 28, 2021 •

edited

Loading