-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate query results (re #35) #339
Conversation
8654b03
to
ba2f39a
Compare
Added frontend support. |
4ae2fe6
to
80d9ab6
Compare
ea9c225
to
e2e66c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes in the right direction and needs some wording and consistency fixes. I've left some questions where I didn't know what the code is intended to do. I think leaving some code comments and docstrings for the new code would be useful.
@@ -19,4 +19,7 @@ <h4 class="modal-title">Refresh Schedule</h4> | |||
Stop scheduling at date/time (format yyyy-MM-ddTHH:mm:ss, like 2016-12-28T14:57:00): | |||
<schedule-until query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-until> | |||
</label> | |||
<label> | |||
Number of result sets to keep <schedule-keep-results query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-keep-results> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be more descriptive Number of query results to keep (leave empty to not keep more than the recent result)
or something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of a small space. I've added a check so that entering 1 or leaving it blank do the same thing.
query: '=', | ||
saveQuery: '=', | ||
}, | ||
template: '<input type="number" class="form-control" ng-model="query.schedule_keep_results" ng-change="saveQuery()">', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the directive used in schedule-dialog.html
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
const ScheduleForm = { | ||
controller() { | ||
this.query = this.resolve.query; | ||
this.saveQuery = this.resolve.saveQuery; | ||
|
||
this.isIncremental = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
client/app/services/query-result.js
Outdated
@@ -54,6 +54,7 @@ function addPointToSeries(point, seriesCollection, seriesName) { | |||
|
|||
function QueryResultService($resource, $timeout, $q) { | |||
const QueryResultResource = $resource('api/query_results/:id', { id: '@id' }, { post: { method: 'POST' } }); | |||
const QueryAggregateResultResource = $resource('api/queries/:id/aggregate_results', { id: '@id' }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the terminology straight, let's use QueryResultSetResource
instead of QueryAggregateResultResource
to stay consistent with the Python API and 'api/queries/:id/resultset'
for the URL. I'm not sure if the term "aggregated" wouldn't confuse people that it's some kind of database aggregation involved while it's really just a list of query results we refer to.
client/app/services/query-result.js
Outdated
@@ -421,6 +422,15 @@ function QueryResultService($resource, $timeout, $q) { | |||
return queryResult; | |||
} | |||
|
|||
static getAggregate(queryId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/getAggregate/getResultSet/g
class QueryResultSet(db.Model): | ||
query_id = Column(db.Integer, db.ForeignKey("queries.id"), | ||
primary_key=True) | ||
query_rel = db.relationship(Query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be query
, no need for the extra suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suffix is to avoid collision with db.Model.query
.
redash/models.py
Outdated
queries = Query.query.filter(Query.schedule_keep_results != None).order_by(Query.schedule_keep_results.desc()) | ||
if queries.first() and queries[0].schedule_keep_results: | ||
resultsets = QueryResultSet.query.filter(QueryResultSet.query_rel == queries[0]).order_by(QueryResultSet.result_id) | ||
c = resultsets.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please a longer variable name.
redash/models.py
Outdated
n_to_delete = c - queries[0].schedule_keep_results | ||
r_ids = [r.result_id for r in resultsets][:n_to_delete] | ||
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False) | ||
print "one", delete_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
redash/models.py
Outdated
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False) | ||
print "one", delete_count | ||
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False) | ||
for q in queries[1:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why there is another loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments added
tests/handlers/test_queries.py
Outdated
q = self.factory.create_query(query_text=qtxt, schedule_keep_results=3) | ||
qr0 = self.factory.create_query_result( | ||
query_text=qtxt, | ||
data = json.dumps({'columns': ['name', 'color'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
..data=json.dumps..
4ec4ac6
to
11be740
Compare
I've addressed the issues you mentioned. Before merging this we'll need to roll staging db back by one migration. |
11be740
to
16f730a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the dropdown for number of query results to keep
disabled when a refresh schedule isn't set?
I'm also wondering if maybe we can put some constraint on that value because there is no maximum for it at the moment.
@@ -200,7 +200,7 @@ function preparePieData(seriesList, options) { | |||
labels, | |||
type: 'pie', | |||
hole: 0.4, | |||
marker: { colors: ColorPaletteArray }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh. Did this change make it into this PR by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... that revision shouldn't be in this branch, you're right. rebase gets things right most of the time and I wasn't vigilant last time I pushed this branch.
|
||
def downgrade(): | ||
op.drop_column(u'queries', 'schedule_keep_results') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was schedule_keep_results
an unused field before? I don't see any other references to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This migration went into master before it should have -- the field was misnamed. We're going to roll back the misnamed field on stage before this PR gets merged.
redash/handlers/query_results.py
Outdated
# Synthesize a result set from the last N results. | ||
total = len(query.query_results) | ||
offset = max(total - query.schedule_resultset_size, 0) | ||
results = [qr.to_dict() for qr in query.query_results[offset:offset + total]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't offset + total
potentially index the query.query_results
list outside of its length if offset > 0 since total = len(query.query_results)
? Would [offset:]
make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
redash/handlers/query_results.py
Outdated
if not results: | ||
aggregate_result = {} | ||
else: | ||
aggregate_result = results[0].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you copied the first result only here because you later replace the data
attribute with the newly computed results above. Perhaps this can be less confusing to read with just a brief comment to explain this?
redash/models.py
Outdated
def delete_stale_resultsets(cls): | ||
delete_count = 0 | ||
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc()) | ||
# Multiple queries with the same text may request multiple result sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume when you say queries with the same text
you mean same SQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. (Though of course this is used for non-SQL data sources as well.)
redash/models.py
Outdated
# be kept. We start with the one that keeps the most, and delete both | ||
# the unneeded bridge rows and result sets. | ||
first_query = queries.first() | ||
if first_query is not None and queries[0].schedule_resultset_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New to sqlalchemy here, how is queries.first()
different from queries[0]
? Can we use the first_query
variable throughout instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[http://docs.sqlalchemy.org/en/rel_1_1/orm/query.html#sqlalchemy.orm.query.Query.first](Return the first result of this Query or None if the result doesn’t contain any row.)
You're right that it's a bit odd to then use queries[0]
here; fixed.
redash/models.py
Outdated
@classmethod | ||
def delete_stale_resultsets(cls): | ||
delete_count = 0 | ||
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if schedule_resultset_size was set by a user and we captured a bunch of results. Then it was unset by the user. Would this mean we wouldn't look at this stale data here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Those are handled before this is called, in cleanup_query_results
.
ddd9e8f
to
49e6c5d
Compare
Fixed a few of the things you mentioned. |
redash/models.py
Outdated
delete_count = 0 | ||
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc()) | ||
# Multiple queries with the same text may request multiple result sets | ||
# be kept. We start with the one that keeps the most, and delete both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we start with the one that keeps the most
because we want to limit how much deleting is done in cleanup_query_results()
? If so, maybe a comment on that? I was confused at first why we only look at the first query but the deleting limit listed in cleanup_query_results()
is my best guess at why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start with the one that keeps the most because if we start with any others it'd delete results that need to be kept.
redash/models.py
Outdated
n_to_delete = resultset_count - first_query.schedule_resultset_size | ||
r_ids = [r.result_id for r in resultsets][:n_to_delete] | ||
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False) | ||
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why these deletes don't contribute to the delete_count
but the QueryResultSet
deletes do? It seems that in cleanup_query_results()
originally only QueryResult
deletes were being counted so maybe these need to be counted too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea how I made that mistake. Fixed.
redash/models.py
Outdated
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False) | ||
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False) | ||
# Delete unneeded bridge rows for the remaining queries. | ||
for q in queries[1:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what's happening in this loop. My understanding is that we look at the query with the largest allowable query results and delete its stale results and bridge rows like you said. That is what happens in the code above.
Since other queries may have pointed to the stale results that were deleted, the bridge rows should be deleted for those too? That's what I would have expected this loop to be doing. But it seems to be looking at whether queries that have more stale results, similar to the check that was done for the first_query
?
I think in general there is some code in this loop that seems to be similar to what is done to the first_query
and I'm wondering if factoring some of it out into a separate function might make it clearer or if it's possible include the operations on first_query
in this loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this point, when the loop starts, there are no stale result sets left. (I'll add a comment pointing this out.)
So all that has to be deleted are bridge rows for queries that have requested fewer resultsets be kept than the first one.
This provides UI for keeping multiple query results and backend support for storing and aggregating them (and deleting them when no longer needed).
Still needs frontend work for choosing to display aggregated results.
fixes #35