Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refresh_queries shouldn't break because of a single query having a bad schedule object #4163

Merged
merged 9 commits into from
Mar 1, 2020

Conversation

rauchy
Copy link
Contributor

@rauchy rauchy commented Feb 12, 2020

Issue Summary

We recently had an issue where a single query had a bad schedule object:

{
    "interval": 604800, 
    "until": null,
    "day_of_week": "Sunday", 
    "time": "Invalid date"
}

And it caused Query.outdated_queries to blow up, which in turn stopped refresh_queries from working.

We should make sure there is enough exception handling that a single query won't stop the scheduler from working.

Technical details:

  • Redash Version: 8

@rauchy
Copy link
Contributor

rauchy commented Jan 9, 2020

While we should basically move on to the next query when an error occurs (and not stop the the loop), we should also handle a case where jobs already expire and can't be fetched (thus raising an exception), and remove the lock in these cases, so these queries can be executed as part of this issue.

@rauchy rauchy requested a review from arikfr February 12, 2020 20:10
@rauchy
Copy link
Contributor

rauchy commented Feb 12, 2020

While wrapping it in try/except blocks, I felt that refresh_queries has too much going on and wasn't easy to read, so I've taken the opportunity to refactor it into something (I see as) more readable.

Query.schedule.isnot(None),
expression.func.coalesce(
expression.text("schedule::json->>'interval'"), None
).isnot(None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the Python version of this a bit easier to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was. I just found the fact that the filtering logic is divided between the query and the enqueuing loops more confusing. I prefer a (slightly) more complex query over mixed concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's split anyway, because later we do tests on the schedule data to see if a query is outdated. How's checking if interval is None any different from the other tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I'm not sure if this test is even needed. I believe it's from an interim version of the code where there was always a schedule object, but later we switched to using None to signal no schedule.

Copy link
Contributor Author

@rauchy rauchy Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's checking if interval is None any different from the other tests?

The only difference is how much complexity it adds. Checking for until was way harder to understand in the query.

later we switched to using None to signal no schedule

Yeah, looks like there are no residues of that so it's safe to remove. (6ad764f)

Comment on lines 665 to 667
logging.info(
"Could not determine if query %d is outdated due to %s",
query.id,
repr(e),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should go into Sentry.
  2. Considering the way our scheduler works this is going to be spammy. Maybe we should report this and then unschedule?

Copy link
Contributor Author

@rauchy rauchy Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding #2 - this function should be idempotent. It's kinda strange to call outdated_queries and have your schedule change as a result.

I feel more comfortable with calling track_failure in these cases - this way users could find out that there was an error with running the query, and it would increase the schedule_failures counter, which would push the next reschedule down the exponential backoff track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would make more sense to use track_failure in _enqueue_queries instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think schedule_failures will work here, because it won't reach the part of the code that calculates the next iteration.

While usually you're right in the approach, we're talking about queries with malformed schedule object. Also, sending a user a notification about it doesn't feel useful as in most cases they won't know what to do about it.

Another approach we can take is to introduce a disabled boolean to the schedule object which we will trigger in such cases (and later can review these). The downside here is that it won't work in case it's completely malformed (not even a valid JSON).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, disabled could be set to True if schedule is broken or schedule['disabled'] is True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting to Sentry is done in 57b1d4f.

Comment on lines 46 to 50
job_exists = Job.exists(job_id)
job_complete = None

if job_exists:
job = Job.fetch(job_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's not very likely, it still can happen that between the call to exists to fetch, the job will be removed. Is there a version of fetch that returns None if job doesn't exist? Or maybe just catch exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated = models.Query.outdated_queries()
refreshable = _skip_unrefreshable_queries(outdated)
queries = _apply_default_parameters(refreshable)
enqueued = list(_enqueue_queries(queries))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to understand, more performant (less loops) and simpler if it was something like:

for query in models.Query.outdated_queries():
    # logic from _skip_unrefreshable_queries
    if not should_refresh_query(query):
        continue
    try:
        enqueue_query(
            # logic from _apply_default_parameters
            apply_default_parameters(query.query_text),
            query.data_source,
            query.user_id,
            scheduled_query=query,
            metadata={"Query ID": query.id, "Username": "Scheduled"},
      )
    except Exception as e:
        logging.info("Could not enqueue query %d due to %s", query.id, repr(e))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, similar comment about exception logging as with the one in outdated_queries. Except that here we should only report to Sentry, but not unschedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplicity / understandability is a matter of preference. Personally I thought of the journey from outdated_queries to enqueued as a set of transformations and filters, so I took a FP-based approach. I've switched to your suggestion in ef2eb39.

Regarding more performant - not sure, those were generator loops and I think there's a built-in optimization there.

@rauchy rauchy force-pushed the resilient-refresh-queries branch from 6ad764f to fea7814 Compare February 25, 2020 21:03
@rauchy rauchy requested a review from arikfr February 27, 2020 09:06
Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We might need to tweak the Sentry reporting, but let's see how it goes.

@rauchy rauchy merged commit a9cb87d into master Mar 1, 2020
@rauchy rauchy deleted the resilient-refresh-queries branch March 1, 2020 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants