Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some websites don't have feeds #222

Open
lemon24 opened this issue Mar 5, 2021 · 1 comment
Open

Some websites don't have feeds #222

lemon24 opened this issue Mar 5, 2021 · 1 comment

Comments

@lemon24
Copy link
Owner

lemon24 commented Mar 5, 2021

Examples:

It should be relatively easy to have a retriever/parser pair that handles URLs like (newlines added for clarity):

magic+
http://example.com/page.html?
magic-entries=<entries anchor CSS selector>&
magic-content=<content CSS selector>

to mean:

  • retrieve http://example.com/page.html
  • for every link that matches entries anchor CSS selector
    • create an entry from the element that matches content CSS selector

Instead of magic-content, we could also use some library that guesses what the content is (there must be some out there).

In its best form, this should also cover the functionality of the sqlite_releases plugin. Of note is that magic-content wouldn't work here, since there's no container for the whole content; also, some of the old versions don't actually have a link.

This will also be a good test of the internal retriever/parser API we implemented in #205.


Open questions:

@lemon24
Copy link
Owner Author

lemon24 commented Apr 27, 2021

Some thoughts about how to implement this in the parser:

If there are multiple things to be retrieved, we can't return them as a single file object; also, we may fabricate "composite" caching headers. I see two options:

  • do all of the parsing in retriever, and return (feed, entries) directly, bypassing parser (I like this one)
  • make the result.http_etag etc. a property that raises an exception if accessed before result.file is actually parsed; seems hacky; get_parser_by_url() would need to support more than exact matching

The first one would look something like this:

# RetrieveResult is renamed to FileResult, and in its place there's an union.
# RetrieverType continues to return ContextManager[Optional[RetrieveResult]]
RetrieveResult = Union[FileResult, ParsedFeed]

# class Parser:
def __call__(self, url, http_etag, http_last_modified):
    parser = self.get_parser_by_url(url)
    ...

    # Must be able to match schemes like magic+http://.
    # Note that prefix match is not enough, 
    # magic+file.txt == file:///magic+file.txt;
    # normalizing the URL beforehand could work.
    retriever = self.get_retriever(url)
   
    with retriever(url, http_etag, http_last_modified, ...) as result:
        if not result:
            return None

        # Parsing already done, return the result (this is new).
        if isinstance(result, ParsedFeed):
            return result

        # Continue with the old logic.

        if not parser:
            ...
        
        feed, entries = parser(url, result.file, result.headers)

    return ParsedFeed(feed, entries, result.http_etag, result.http_last_modified)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant