Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ParseError exceptions during update_feeds() prevent updates for the remaining feeds #218

Closed
lemon24 opened this issue Feb 18, 2021 · 7 comments
Labels

Comments

@lemon24
Copy link
Owner

lemon24 commented Feb 18, 2021

From #204 (comment):

Currently, storage or plugin-raised exceptions prevent updates for the following feeds, but that's not necessarily by design.

We should look at this when we expose the plugin API (#80).

@lemon24
Copy link
Owner Author

lemon24 commented Jun 15, 2023

...like here: #310 (comment) (the Twitter plugin).

@lemon24
Copy link
Owner Author

lemon24 commented Jun 17, 2023

The main goal here is to allow updates to continue after a feed has failed.

Things to take into consideration:

  • We don't really need to wrap hook (including Session hooks) and retriever/(sub)parser calls one by one, we may get away with doing it somewhere at "top level", e.g. here in Pipeline.update().

    • But, we should be consistent / uniform.
  • OTOH, when should unexpected exceptions (bugs) not allow updates to other feeds to continue?

    1. reader "infrastructure" code (Parser, updater) should raise immediately; this is a reader bug, and Should Never Happen™.
      • How about StorageErrors (e.g. when the updater stores the feed)?
    2. reader built-in retrievers/parsers should also probably raise immediately, but it may be acceptable to continue.
    3. User-/plugin-supplied retrievers/parsers should continue (we've seen at least one bug here, in the Twitter plugin).
    4. Update hooks should continue (we've seen a bug here too, IIRC).
      • What about {before,after}_feeds_update_hooks, which go "around" the update loop? We don't need to wrap them, but we may want to, for consistency.
      • What if an after_feeds_update_hooks depends on an after_feed_update_hooks having run successfully? Arguably, it shouldn't (we might want to mention this somewhere in the docs).
      • What about a Session hook, when called by a built-in retriever? It should continue.
        • Update: These already raise ParseError (when called by the HTTP retriever).
  • Update: How granular should we be with allowing things to continue?

    • What do we return after_{entry,feed}_update_hooks fails?
      • In theory, the feed was updated at this point, so we have an UpdatedFeed we can return; we could return the exception "on the side" somehow (should have UpdateResult and UpdatedFeed been a single struct from the beginning?).
      • OTOH, hooks "should not fail" – if the user said a plugin should run, the plugin failing should be an error.
      • Of course, we can go as granular as it is useful – but how granular is that? (See the next question too.)
    • What happens if after_entry_update_hooks fails? Do we try to run it for the rest of the entries?

lemon24 added a commit that referenced this issue Aug 3, 2023
lemon24 added a commit that referenced this issue Aug 3, 2023
(update hooks, session hooks, retriever/parser unexpected exceptions).

For #218.
@lemon24
Copy link
Owner Author

lemon24 commented Aug 6, 2023

Some thoughts on update hook exception handling (also see the "How granular ..." question above):

  • Hooks "should not fail". Treating them otherwise opens up a lot of other decisions that need to be made, that potentially makes the end user API too complicated.
  • Currently, a hook failure for a feed raises immediately – that is, it prevents all other hooks from running for that feed.
  • Hooks that want to be resilient can follow the "tag before, clear tag after" pattern mentioned here: Handle redirects and gone feeds gracefully #246 (comment). This allows completing the work the plugin was supposed to do on a subsequent run.
    • This might prove to not be enough / make writing plugins too complicated. We can add more granularity later, if needed, but at the expense of API churn for plugins.
  • Nevertheless, failures happen; ideally, it should be hard for an updated feed to remain unprocessed because of a failure; moreso, a hook that works should not have to pay the price for one at the same stage that doesn't.
  • We can run all the hooks and raise an ExceptionGroup (Python 3.11) for the hooks that failed.
    • But, if we do this for all stages, we or the plugins might need to coordinate between hooks belonging to the same plugin, e.g. before_feed_... and a corresponding after_feed_....
    • For stages before the feed is updated (before_feed_...), does it make sense to continue with the feed update? If we fail early here, it simplifies the mental model quite a bit.
    • For after_entry_..., raising an ExceptionGroup isn't all that useful (and we might end up with thousands of exceptions).

Somewhat related: We continue with at least the granularity of one feed mostly because update_feeds_iter() is a thin wrapper over for feed in get_feeds(): update_feed(feed) (but with exception handling; we should document this).

@lemon24
Copy link
Owner Author

lemon24 commented Aug 6, 2023

OK, here's how update hooks get called (pseudocode; might want to document it when we're sure the order stays):

for hook in before_feeds_update_hooks:
    hook(reader)

for feed in get_feeds():
    try:
        parsed_feed = parse(feed)
    except ParseError as e:
        parsed_feed = e

    intents = make_intents(feed, parsed_feed)

    for hook in before_feed_update_hooks:
        hook(reader, feed.url)

    store(intents)

    for entry in intents.entries:
        for hook in after_entry_update_hooks:
            hook(reader, entry.entry, entry.status)

    for hook in after_feed_update_hooks:
        hook(reader, feed.url)

    if not parsed_feed or isinstance(parsed_feed, Exception):
        yield feed.url, parsed_feed
    else:
        yield feed.url, make_updated_feed(intents)
    
for hook in after_feeds_update_hooks:
    hook(reader)

Following is a proposal that takes into account the notes from previous comments.

Goals:

  • Wrap hook failures so they don't prevent updates for other feeds during update_feeds_iter().
  • Wrap hook failures so it is clear where the error came from (which phase, optionally which hook).
  • After data has been written to storage, do our best to allow hooks to run (if one fails, the others should still run).

Hooks wrap unexpected exceptions as follows (new exceptions detailed below):

  • before_feeds_update_hooks -> raise SingleUpdateHookError
    • fail early at the first hook failure
    • not backwards compatible; however, exceptions are currently raised directly, which is arguably a bug
  • before_feed_update_hooks -> yield url, SingleUpdateHookError
    • fail early at the first hook failure
    • backwards compatible, we can already yield ReaderError
  • after_entry_update_hooks -> yield url, UpdateHookErrorGroup
    • after also running after_feed_update_hooks
    • if there are concerns about keeping to many exceptions around (max is entries * entry hooks), we can keep the first 5 or something, and log the rest; there is precedent for this
    • backwards compatible, we can already yield ReaderError
  • after_feed_update_hooks -> yield url, UpdateHookErrorGroup
    • including errors from after_entry_update_hooks
    • shadows UpdatedFeed, acceptable (for now)
    • backwards compatible, we can already yield ReaderError
  • after_feeds_update_hooks -> raise UpdateHookErrorGroup
    • not backwards compatible, same as before_feeds_update_hooks

We add the following new exceptions:

ReaderError (old)
 +--- UpdateError
       +--- UpdateHookError
       |     +--- SingleUpdateHookError
       |     +--- UpdateHookErrorGroup [ExceptionGroup]
       +--- ParseError (old)

UpdateHookError will not be raised directly; it exists and has the shorter name because most people won't care about the individual errors (more than to log them), and will do except UpdateHookError; interested people can do except* SingleUpdateHookError to look at individual errors. SingleUpdateHookError cannot just be UpdateHookError because it will have attributes about the hook and stage that UpdateHookErrorGroup cannot not have.

For Python 3.10, which doesn't have ExceptionGroup, we use a shim like this one:
import traceback

class _ExceptionGroup(Exception):
    """ExceptionGroup shim for Python 3.10.

    Caveat: The tracebacks always show in str(exc).
    
    Avoids dependency on https://pypi.org/project/exceptiongroup/

    """
    def __init__(self, msg, excs):
        super().__init__(msg, tuple(excs))
        
    @property
    def message(self):
        return self.args[0]
    
    @property
    def exceptions(self):
        return self.args[1]
            
    def _format_lines(self):
        count = len(self.exceptions)
        s = 's' if count != 1 else ''
        yield f"{self.message} ({count} sub-exception{s})\n"
        for i, exc in enumerate(self.exceptions, 1):
            yield f"+{f' {i} '.center(36, '-')}\n"
            for line in traceback.format_exception(exc):
                for line in line.rstrip().splitlines():
                    yield f"| {line}\n"
        yield f"+{'-' * 36}\n"
            
    def __str__(self):
        return ''.join(self._format_lines()).rstrip()

try:
    ExceptionGroup
except NameError:
    ExceptionGroup = _ExceptionGroup
    
# example

try:
    1/0
except Exception as e:
    one = e

raise _ExceptionGroup('stuff happened', [
    NameError('name'), 
    _ExceptionGroup('more stuff happened', [
        one, 
        AttributeError('attr'),
    ]),
])

"""
Traceback (most recent call last):
  File ".../exc.py", line 47, in <module>
    raise _ExceptionGroup('stuff happened', [
__main__._ExceptionGroup: stuff happened (2 sub-exceptions)
+---------------- 1 -----------------
| NameError: name
+---------------- 2 -----------------
| _ExceptionGroup: more stuff happened (2 sub-exceptions)
| +---------------- 1 -----------------
| | Traceback (most recent call last):
| |   File ".../exc.py", line 43, in <module>
| |     1/0
| | ZeroDivisionError: division by zero
| +---------------- 2 -----------------
| | AttributeError: attr
| +------------------------------------
+------------------------------------
"""

Open Closed questions:

  • Is it OK for after_feed_update_hooks errors to shadow UpdatedFeed?
    • Yes; we got the side effects that we expected, but we still signal errors.
  • Is it OK for after_feed_update_hooks errors to shadow a ParseError?
    • No; arguably, after_*_update_hooks should not even run if there was an error getting the feed. ACTION ITEM
    • If this isn't enough, we can add later an UpdateErrorGroup to include both the parse and hook errors.
    • before_feed_update_hooks may run if we move them to before getting the feed, or if they become able to transform the feed update intent before it is stored (a feed update intent exists even if there was a ParseError, to set last_exception).
  • Should the feed or entry hooks set Feed.last_exception?

@lemon24
Copy link
Owner Author

lemon24 commented Aug 13, 2023

To do:

  • update hooks
    • new exception types
    • wrap exceptions
    • better logging
    • update_feeds() UpdateHookError handling
    • change UpdateResult.value to be UpdateError
    • tests
    • refactor update hook passing and application
    • CLI updates
    • docs, versionchanged, changelog
  • wrap sub-parser/retriever exceptions
  • wrap session hook exceptions (nothing to do)

@lemon24
Copy link
Owner Author

lemon24 commented Aug 21, 2023

OK, moving on to unexpected retriever/parser unexpected errors: per #218 (comment), they should not prevent updates for remaining feeds; wrapping seems the way to go.

(1) A possible maximal exception hierarchy:

ReaderError (old)
 +--- UpdateError (old)
       +--- ParseError [FeedError, ReaderWarning] (old)
       |     +--- UnexpectedParseError
       +--- RetrieveError [ParseError (temporary)]
             +--- UnexpectedRetrieverError

This has the benefit of distinguishing clearly between the errors caused by a retriever and by those caused by a (sub-)parser, while not making the hierarchy to deep.

Some disadvantages:

  • RetrieveError must remain a ParseError subclass until 4.0 to maintain backwards compatibility
  • exceptions raised in reader._parser don't have a specific common parent
  • there is no exception for non-retriever, non-sub-parser errors (e.g. raised in the meta-Parser ... but we don't have case like this now)

(2) Alternative that doesn't have them, at the expense of deeper hierarchy and slightly confusing naming (ParserError):

ReaderError (old)
 +--- UpdateError (old)
       +--- ParseError [FeedError, ReaderWarning] (old)
             +--- RetrieverError
             |     +--- UnexpectedRetrieverError
             +--- ParserError
                   +--- UnexpectedParserError      

The minimal solution is to either (3) add no new classes, or (4) add a single UnexpectedParseError subclass.

An additional complication is that currently, Feed.last_exception contains details about the cause of the ParseError, if any, not the ParseError itself. If ParseError gets subclasses, it becomes imperative to also have details about the exception itself (type, message); arguably, we should have done this from the start (it would also allow capturing UpdateHookError details, if we ever decide to).

Update: I went with solution 3. ccf8d26

@lemon24
Copy link
Owner Author

lemon24 commented Aug 24, 2023

Session hook error already get wrapped in ParseError, and we have a test for it; good enough, for now.

@lemon24 lemon24 closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant