Some websites don't have feeds #222

lemon24 · 2021-03-05T07:52:49Z

Examples:

It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):

magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>

to mean:

retrieve http://example.com/page.html
for every link that matches entries anchor CSS selector
- create an entry from the element that matches content CSS selector

Instead of magic-content, we could also use some library that guesses what the content is (there must be some out there).

In its best form, this should also cover the functionality of the sqlite_releases plugin. Of note is that magic-content wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.

This will also be a good test of the internal retriever/parser API we implemented in #205.

Open questions:

what content extraction library do we use?
- https://github.com/adbar/trafilatura
- https://trafilatura.readthedocs.io/en/latest/evaluation.html
- https://github.com/scrapinghub/article-extraction-benchmark
- https://github.com/alan-turing-institute/ReadabiliPy, https://github.com/buriy/python-readability
- https://newspaper.readthedocs.io/en/latest/
- https://github.com/goose3/goose3
- after a quick look at the above, only python-readability preserves the spans used for code highlighting (we want to preserve as much HTML as possible); it also seems to have a sanitization feature (related to Fix sanitization #157)
how do we handle published/updated times?
- https://github.com/adbar/htmldate
what happens if the website gets a feed?
- change_feed_url()?

The text was updated successfully, but these errors were encountered:

lemon24 · 2021-04-27T11:10:31Z

Some thoughts about how to implement this in the parser:

If there are multiple things to be retrieved, we can't return them as a single file object; also, we may fabricate "composite" caching headers. I see two options:

do all of the parsing in retriever, and return (feed, entries) directly, bypassing parser (I like this one)
make the result.http_etag etc. a property that raises an exception if accessed before result.file is actually parsed; seems hacky; get_parser_by_url() would need to support more than exact matching

The first one would look something like this:

# RetrieveResult is renamed to FileResult, and in its place there's an union.
# RetrieverType continues to return ContextManager[Optional[RetrieveResult]]
RetrieveResult = Union[FileResult, ParsedFeed]

# class Parser:
def __call__(self, url, http_etag, http_last_modified):
    parser = self.get_parser_by_url(url)
    ...

    # Must be able to match schemes like magic+http://.
    # Note that prefix match is not enough, 
    # magic+file.txt == file:///magic+file.txt;
    # normalizing the URL beforehand could work.
    retriever = self.get_retriever(url)
   
    with retriever(url, http_etag, http_last_modified, ...) as result:
        if not result:
            return None

        # Parsing already done, return the result (this is new).
        if isinstance(result, ParsedFeed):
            return result

        # Continue with the old logic.

        if not parser:
            ...
        
        feed, entries = parser(url, result.file, result.headers)

    return ParsedFeed(feed, entries, result.http_etag, result.http_last_modified)

lemon24 mentioned this issue Apr 27, 2021

Make default_parser() part of the public API #235

Closed

lemon24 mentioned this issue Oct 20, 2021

Some feeds have only the last X entries #239

Closed

lemon24 added the feed parsing label Jan 29, 2022

lemon24 mentioned this issue Feb 13, 2022

Twitter support #271

Closed

lemon24 mentioned this issue Jun 23, 2023

Save full content plugin #311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some websites don't have feeds #222

Some websites don't have feeds #222

lemon24 commented Mar 5, 2021 •

edited

Loading

lemon24 commented Apr 27, 2021

Some websites don't have feeds #222

Some websites don't have feeds #222

Comments

lemon24 commented Mar 5, 2021 • edited Loading

lemon24 commented Apr 27, 2021

lemon24 commented Mar 5, 2021 •

edited

Loading