Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Scheduled Queries Report #3798

Merged
merged 67 commits into from
Jul 28, 2019
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
dae6013
initial work on e-mail report for failed queries
May 14, 2019
ca7e3a2
send failure report only for scheduled queries and not for adhoc queries
May 14, 2019
b388c6f
add setting to determine if to send failure reports
May 14, 2019
e910fd3
add setting to determine interval of aggregated e-mail report
May 14, 2019
e695549
html templating of scheduled query failure report
May 14, 2019
13e9bb1
break line
May 14, 2019
95be1a9
support timeouts for failure reports
May 14, 2019
ba299cd
aggregate errors within message and warn if approaching threshold
May 14, 2019
075c2a6
Merge branch 'master' into failed-queries-report
May 19, 2019
0722195
handle errors in QueryExecutor.run instead of on_failure
May 19, 2019
8959688
move failure report to its own module
May 19, 2019
d88e589
indicate that failure count is since last report
May 19, 2019
5adb58e
copy changes
May 19, 2019
db6086f
format with <code>
May 19, 2019
b90d673
styling, copy and add a link to the query instead of the query text
May 19, 2019
ac9e896
separate reports with <hr>
May 19, 2019
4e1ffe5
switch to UTC
May 19, 2019
66cea95
move <h2> to actual e-mail subject
May 19, 2019
6e20846
add explicit message for SoftTimeLimitExceeded
May 19, 2019
eb8275c
fix test to use soft time limits
May 19, 2019
6ba466f
default query failure threshold to 100
May 19, 2019
8767011
use base_url from utils
May 19, 2019
045789d
newlines. newlines everywhere.
May 19, 2019
6f3c426
remove redundant import
May 19, 2019
abe6a36
apply new design for failure report
May 21, 2019
a2072b6
use jinja to format the failure report
May 21, 2019
102c1e7
don't show comment block if no comment is provided
May 21, 2019
04458be
don't send emails if, for some reason, there are no available errors
May 21, 2019
4770c43
subtract 1 from failure count, because the first one is represented b…
May 21, 2019
5011198
don't show '+X more failures' if there's only one
May 21, 2019
7fe20ea
extract subject to variable
May 21, 2019
9c38f3d
format as text, while we're at it
May 21, 2019
b61b7eb
allow scrolling for long exception messages
May 21, 2019
90d14ad
test that e-mails are scheduled only when beneath limit
May 21, 2019
d504c2b
test for indicating when approaching report limits + refactor
May 21, 2019
90c6a30
test that failures are aggregated
May 21, 2019
f539bee
test that report counts per query and reason
May 21, 2019
6b6bc0e
test that the latest failure occurence is reported
May 21, 2019
3f40de8
force sending reports for testing purposes
May 21, 2019
99ed585
Update redash/templates/emails/failures.html
May 22, 2019
722f7d2
Update redash/templates/emails/failures.html
May 22, 2019
a08a249
Update redash/tasks/failure_report.py
May 22, 2019
bfdccb4
Merge branch 'master' into failed-queries-report
May 27, 2019
d4d51dd
Merge branch 'master' into failed-queries-report
May 30, 2019
3a281ba
add org setting for email reports
May 31, 2019
85a319e
Merge branch 'failed-queries-report' of github.com:getredash/redash i…
Jun 2, 2019
5e5e456
Merge branch 'master' into failed-queries-report
Jun 10, 2019
6ff0962
remove logo from failure report email
Jun 10, 2019
9215c70
Merge branch 'master' into failed-queries-report
Jul 14, 2019
5043b5b
Merge branch 'master' into failed-queries-report
Jul 18, 2019
6db4803
Merge branch 'master' into failed-queries-report
Jul 20, 2019
57515ef
correctly use the organization setting for sending failure reports
Jul 20, 2019
0bfe9f4
use user id as key for failure reports data structure
Jul 20, 2019
72fb0b1
Update redash/tasks/failure_report.py
Jul 20, 2019
8182365
build comments while creating context for e-mail templates
Jul 20, 2019
cdd72b6
Merge branch 'failed-queries-report' of github.com:getredash/redash i…
Jul 20, 2019
4e80124
figure out the base url when creating the e-mail
Jul 20, 2019
8333e25
no need to expire pending failure report keys as they are deleted any…
Jul 20, 2019
688b216
a couple of CodeClimate changes
Jul 20, 2019
bafe782
refactor key creationg to a single location
Jul 20, 2019
2275d31
refactor tests to send e-mail from a single function
Jul 20, 2019
2fb4882
use beat to schedule a periodic send_aggregated_errors task instead o…
Jul 21, 2019
6f0066e
remove pending key as it is no longer required when a periodic task p…
Jul 21, 2019
c7d8ed8
a really important blank line. REALLY important.
Jul 21, 2019
a989c25
Revert "a really important blank line. REALLY important."
Jul 21, 2019
4a95817
a really important blank line. REALLY important. It is the best blank…
Jul 21, 2019
7b232be
don't send failure emails to disabled users
Jul 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions redash/settings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,10 @@ def email_server_is_configured():

HOST = os.environ.get('REDASH_HOST', '')

SEND_EMAIL_ON_FAILED_SCHEDULED_QUERIES = parse_boolean(os.environ.get('REDASH_SEND_EMAIL_ON_FAILED_SCHEDULED_QUERIES', 'false'))
SEND_FAILURE_EMAIL_INTERVAL = int(os.environ.get('REDASH_SEND_FAILURE_EMAIL_INTERVAL', 3600))
MAX_FAILURE_REPORTS_PER_QUERY = int(os.environ.get('REDASH_MAX_FAILURE_REPORTS_PER_QUERY', 100))
arikfr marked this conversation as resolved.
Show resolved Hide resolved

ALERTS_DEFAULT_MAIL_SUBJECT_TEMPLATE = os.environ.get('REDASH_ALERTS_DEFAULT_MAIL_SUBJECT_TEMPLATE', "({state}) {alert_name}")

# How many requests are allowed per IP to the login page before
Expand Down
1 change: 1 addition & 0 deletions redash/tasks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .general import record_event, version_check, send_mail, sync_user_details
from .queries import QueryTask, refresh_queries, refresh_schemas, cleanup_query_results, execute_query
from .alerts import check_alerts_for_query
from .failure_report import notify_of_failure
65 changes: 65 additions & 0 deletions redash/tasks/failure_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
import datetime
from collections import Counter
from flask import render_template
from redash.tasks.general import send_mail
from redash.worker import celery
from redash import redis_connection, settings
from redash.utils import json_dumps, json_loads, base_url


@celery.task(name="redash.tasks.send_aggregated_errors")
def send_aggregated_errors(email_address):
key = 'aggregated_failures:{}'.format(email_address)
arikfr marked this conversation as resolved.
Show resolved Hide resolved
errors = [json_loads(e) for e in redis_connection.lrange(key, 0, -1)]

if errors:
errors.reverse()
occurrences = Counter((e.get('id'), e.get('message')) for e in errors)
unique_errors = {(e.get('id'), e.get('message')): e for e in errors}

context = {
'failures': [{
'base_url': v.get('base_url'),
'id': v.get('id'),
'name': v.get('name'),
'failed_at': v.get('failed_at'),
'failure_reason': v.get('message'),
'failure_count': occurrences[k],
'comment': v.get('comment')
} for k, v in unique_errors.iteritems()]
}

html = render_template('emails/failures.html', **context)
text = render_template('emails/failures.txt', **context)
subject = "Redash failed to execute {} of your queries".format(len(unique_errors.keys()))
rauchy marked this conversation as resolved.
Show resolved Hide resolved
send_mail.delay([email_address], subject, html, text)

redis_connection.delete(key)


def notify_of_failure(message, query):
if not settings.SEND_EMAIL_ON_FAILED_SCHEDULED_QUERIES:
return
arikfr marked this conversation as resolved.
Show resolved Hide resolved

if query.schedule_failures < settings.MAX_FAILURE_REPORTS_PER_QUERY:
key = 'aggregated_failures:{}'.format(query.user.email)
reporting_will_soon_stop = query.schedule_failures > settings.MAX_FAILURE_REPORTS_PER_QUERY * 0.75
comment = """This query has failed a total of {failure_count} times.
rauchy marked this conversation as resolved.
Show resolved Hide resolved
Reporting may stop when the query exceeds {max_failure_reports} overall failures.""".format(
failure_count=query.schedule_failures,
max_failure_reports=settings.MAX_FAILURE_REPORTS_PER_QUERY
) if reporting_will_soon_stop else ''

redis_connection.lpush(key, json_dumps({
'id': query.id,
'name': query.name,
'base_url': base_url(query.org),
arikfr marked this conversation as resolved.
Show resolved Hide resolved
'message': message,
'comment': comment,
'failed_at': datetime.datetime.utcnow().strftime("%B %d, %Y %I:%M%p UTC")
}))

if not redis_connection.exists('{}:pending'.format(key)):
send_aggregated_errors.apply_async(args=(query.user.email,), countdown=settings.SEND_FAILURE_EMAIL_INTERVAL)
arikfr marked this conversation as resolved.
Show resolved Hide resolved
redis_connection.set('{}:pending'.format(key), 1)
redis_connection.expire('{}:pending'.format(key), settings.SEND_FAILURE_EMAIL_INTERVAL)
arikfr marked this conversation as resolved.
Show resolved Hide resolved
15 changes: 11 additions & 4 deletions redash/tasks/queries.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import logging
import signal
import time

import redis
from celery.exceptions import SoftTimeLimitExceeded, TimeLimitExceeded
from celery.result import AsyncResult
Expand All @@ -11,10 +10,12 @@
from redash import models, redis_connection, settings, statsd_client
from redash.query_runner import InterruptException
from redash.tasks.alerts import check_alerts_for_query
from redash.tasks.failure_report import notify_of_failure
from redash.utils import gen_query_hash, json_dumps, utcnow, mustache_render
from redash.worker import celery

logger = get_task_logger(__name__)
TIMEOUT_MESSAGE = "Query exceeded Redash query execution time limit."


def _job_lock_id(query_hash, data_source_id):
Expand Down Expand Up @@ -56,7 +57,7 @@ def to_dict(self):
status = self.STATUSES[task_status]

if isinstance(result, (TimeLimitExceeded, SoftTimeLimitExceeded)):
error = "Query exceeded Redash query execution time limit."
error = TIMEOUT_MESSAGE
status = 4
elif isinstance(result, Exception):
error = result.message
Expand Down Expand Up @@ -142,7 +143,7 @@ def enqueue_query(query, data_source, user_id, is_api_key=False, scheduled_query
result = execute_query.apply_async(args=args,
argsrepr=argsrepr,
queue=queue_name,
time_limit=time_limit)
soft_time_limit=time_limit)

job = QueryTask(async_result=result)
logging.info("[%s] Created new job: %s", query_hash, job.id)
Expand Down Expand Up @@ -322,7 +323,11 @@ def run(self):
try:
data, error = query_runner.run_query(annotated_query, self.user)
except Exception as e:
error = text_type(e)
if isinstance(e, SoftTimeLimitExceeded):
error = TIMEOUT_MESSAGE
else:
error = text_type(e)

data = None
logging.warning('Unexpected error while running query:', exc_info=1)

Expand All @@ -338,6 +343,7 @@ def run(self):
self.scheduled_query = models.db.session.merge(self.scheduled_query, load=False)
self.scheduled_query.schedule_failures += 1
models.db.session.add(self.scheduled_query)
notify_of_failure(error, self.scheduled_query)
models.db.session.commit()
raise result
else:
Expand Down Expand Up @@ -395,5 +401,6 @@ def execute_query(self, query, data_source_id, metadata, user_id=None,
scheduled_query = models.Query.query.get(scheduled_query_id)
else:
scheduled_query = None

return QueryExecutor(self, query, data_source_id, user_id, is_api_key, metadata,
scheduled_query).run()
45 changes: 45 additions & 0 deletions redash/templates/emails/failures.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<html>

<head>
<style>
body {
margin: 0;
}
</style>
</head>

<body style="margin: 0;">
<div
style="background: #f6f8f9; font-family: arial; font-size: 14px; padding: 20px; color: #333; line-height: 20px">
<div style="padding-bottom: 23px; margin-bottom: 22px; border-bottom: 1px solid #e4e4e4;">
<img height="50"
src="" />
</div>
<h1 style="font-size: 14px; font-weight: normal;">Redash failed to run the following queries:</h1>

{% for failure in failures %}
<div style="background: white;padding: 5px 20px 20px 20px; border-radius: 3px; box-shadow: 0 4px 9px -3px rgba(102,136,153,.15); margin-top: 20px">
<h2 style="font-size: 16px; position: relative">{{failure.name}} <a href="{{failure.base_url}}/queries/{{failure.id}}" style="background-color: #ff7964;font-size: 13px;color: white;text-decoration: none;padding: 0 5px;display: inline-block;border-radius: 3px;font-weight: normal;position: absolute;right: 0;top:-1px">Go
rauchy marked this conversation as resolved.
Show resolved Hide resolved
to Query</a></h2>
<div>
<p>
Last failed: {{failure.failed_at}}
{% if failure.failure_count > 1 %}
<br />
<small style="color: #8c8c8c;font-size: 13px;">&nbsp; + {{failure.failure_count - 1}} more failures since last report</small>
{% endif %}
</p>
<h3 style="font-size: 14px; font-weight:normal; margin-bottom: 6px">Exception</h3>
<div style="background: #ececec;padding: 8px;border-radius: 3px;font-family: monospace; overflow: scroll;">{{failure.failure_reason}}</div>
rauchy marked this conversation as resolved.
Show resolved Hide resolved
</div>

{% if failure.comment %}
<div style="margin-top: 25px;font-size: 13px;border-top: 1px solid #ececec;padding-top: 19px;">{{failure.comment}}</div>
ranbena marked this conversation as resolved.
Show resolved Hide resolved
{% endif %}
</div>
{% endfor %}

</div>
</body>

</html>
15 changes: 15 additions & 0 deletions redash/templates/emails/failures.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Redash failed to run the following queries:

{% for failure in failures %}

{{failure.name}} ({{failure.base_url}}/queries/{{failure.id}})
Last failed: {{failure.failed_at}}{% if failure.failure_count > 1 %} + {{failure.failure_count - 1}} more failures since last report{% endif %}
Exception:

{{failure.failure_reason}}

{% if failure.comment %}
{{failure.comment}}
{% endif %}

{% endfor %}
91 changes: 91 additions & 0 deletions tests/tasks/test_failure_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from unittest import TestCase

import mock
from freezegun import freeze_time
import dateutil

from tests import BaseTestCase
from redash import redis_connection, models, settings
from redash.tasks.failure_report import notify_of_failure, send_aggregated_errors
from redash.utils import json_loads

class TestSendAggregatedErrorsTask(BaseTestCase):
def setUp(self):
super(TestSendAggregatedErrorsTask, self).setUp()
redis_connection.flushall()
settings.SEND_EMAIL_ON_FAILED_SCHEDULED_QUERIES = True

def tearDown(self):
super(TestSendAggregatedErrorsTask, self).tearDown()
settings.SEND_EMAIL_ON_FAILED_SCHEDULED_QUERIES = False

def notify(self, message="Oh no, I failed!", query=None, **kwargs):
if query is None:
query = self.factory.create_query(**kwargs)

notify_of_failure(message, query)
return "aggregated_failures:{}".format(query.user.email)

def test_schedules_email_if_failure_count_is_beneath_limit(self):
key = self.notify(schedule_failures=settings.MAX_FAILURE_REPORTS_PER_QUERY - 1)
email_pending = redis_connection.get("{}:pending".format(key))
self.assertTrue(email_pending)

def test_does_not_report_if_failure_count_is_beyond_limit(self):
key = self.notify(schedule_failures=settings.MAX_FAILURE_REPORTS_PER_QUERY)
email_pending = redis_connection.get("{}:pending".format(key))
self.assertFalse(email_pending)

def test_does_not_indicate_when_not_near_limit_for_a_query(self):
key = self.notify(schedule_failures=settings.MAX_FAILURE_REPORTS_PER_QUERY / 2)
failure = json_loads(redis_connection.lrange(key, 0, -1)[0])
comment = failure.get('comment')
self.assertFalse(comment)

def test_indicates_when_near_limit_for_a_query(self):
key = self.notify(schedule_failures=settings.MAX_FAILURE_REPORTS_PER_QUERY - 1)
failure = json_loads(redis_connection.lrange(key, 0, -1)[0])
comment = failure.get('comment')
self.assertTrue(comment)

def test_aggregates_different_queries_in_a_single_report(self):
key1 = self.notify(message="I'm a failure")
key2 = self.notify(message="I'm simply not a success")

self.assertEqual(key1, key2)

@mock.patch('redash.tasks.failure_report.render_template')
def test_counts_failures_for_each_reason(self, render_template):
query = self.factory.create_query()

self.notify(message="I'm a failure", query=query)
self.notify(message="I'm a failure", query=query)
self.notify(message="I'm a different type of failure", query=query)
self.notify(message="I'm a totally different query")

send_aggregated_errors(query.user.email)

_, context = render_template.call_args
failures = context['failures']

f1 = next(f for f in failures if f["failure_reason"] == "I'm a failure")
self.assertEqual(2, f1['failure_count'])
f2 = next(f for f in failures if f["failure_reason"] == "I'm a different type of failure")
self.assertEqual(1, f2['failure_count'])
f3 = next(f for f in failures if f["failure_reason"] == "I'm a totally different query")
self.assertEqual(1, f3['failure_count'])

@mock.patch('redash.tasks.failure_report.render_template')
def test_shows_latest_failure_time(self, render_template):
query = self.factory.create_query()

with freeze_time("2000-01-01"):
self.notify(query=query)

self.notify(query=query)

send_aggregated_errors(query.user.email)

_, context = render_template.call_args
latest_failure = dateutil.parser.parse(context['failures'][0]['failed_at'])
self.assertNotEqual(2000, latest_failure.year)
2 changes: 1 addition & 1 deletion tests/tasks/test_queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def test_limits_query_time(self, _):
enqueue_query(query.query_text, query.data_source, query.user_id, False, query, {'Username': 'Arik', 'Query ID': query.id})

_, kwargs = execute_query.apply_async.call_args
self.assertEqual(60, kwargs.get('time_limit'))
self.assertEqual(60, kwargs.get('soft_time_limit'))

def test_multiple_enqueue_of_different_query(self):
query = self.factory.create_query()
Expand Down