Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate query results (re #35) #339

Merged
merged 7 commits into from
Mar 27, 2018
Merged

Aggregate query results (re #35) #339

merged 7 commits into from
Mar 27, 2018

Conversation

washort
Copy link

@washort washort commented Feb 20, 2018

This provides UI for keeping multiple query results and backend support for storing and aggregating them (and deleting them when no longer needed).

Still needs frontend work for choosing to display aggregated results.

fixes #35

@rafrombrc rafrombrc added this to the 13 milestone Feb 21, 2018
@washort washort force-pushed the incremental-jobs-35 branch 2 times, most recently from 8654b03 to ba2f39a Compare February 23, 2018 16:36
@washort
Copy link
Author

washort commented Feb 23, 2018

Added frontend support.

@washort washort requested a review from jezdez March 1, 2018 22:14
@jezdez jezdez force-pushed the master branch 2 times, most recently from 4ae2fe6 to 80d9ab6 Compare March 5, 2018 21:35
@washort washort force-pushed the incremental-jobs-35 branch 2 times, most recently from ea9c225 to e2e66c9 Compare March 6, 2018 18:35
jezdez
jezdez previously requested changes Mar 6, 2018
Copy link

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes in the right direction and needs some wording and consistency fixes. I've left some questions where I didn't know what the code is intended to do. I think leaving some code comments and docstrings for the new code would be useful.

@@ -19,4 +19,7 @@ <h4 class="modal-title">Refresh Schedule</h4>
Stop scheduling at date/time (format yyyy-MM-ddTHH:mm:ss, like 2016-12-28T14:57:00):
<schedule-until query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-until>
</label>
<label>
Number of result sets to keep <schedule-keep-results query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-keep-results>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be more descriptive Number of query results to keep (leave empty to not keep more than the recent result) or something like this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a small space. I've added a check so that entering 1 or leaving it blank do the same thing.

query: '=',
saveQuery: '=',
},
template: '<input type="number" class="form-control" ng-model="query.schedule_keep_results" ng-change="saveQuery()">',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the directive used in schedule-dialog.html above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

const ScheduleForm = {
controller() {
this.query = this.resolve.query;
this.saveQuery = this.resolve.saveQuery;

this.isIncremental = false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

@@ -54,6 +54,7 @@ function addPointToSeries(point, seriesCollection, seriesName) {

function QueryResultService($resource, $timeout, $q) {
const QueryResultResource = $resource('api/query_results/:id', { id: '@id' }, { post: { method: 'POST' } });
const QueryAggregateResultResource = $resource('api/queries/:id/aggregate_results', { id: '@id' });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the terminology straight, let's use QueryResultSetResource instead of QueryAggregateResultResource to stay consistent with the Python API and 'api/queries/:id/resultset' for the URL. I'm not sure if the term "aggregated" wouldn't confuse people that it's some kind of database aggregation involved while it's really just a list of query results we refer to.

@@ -421,6 +422,15 @@ function QueryResultService($resource, $timeout, $q) {
return queryResult;
}

static getAggregate(queryId) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/getAggregate/getResultSet/g

class QueryResultSet(db.Model):
query_id = Column(db.Integer, db.ForeignKey("queries.id"),
primary_key=True)
query_rel = db.relationship(Query)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be query, no need for the extra suffix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suffix is to avoid collision with db.Model.query.

redash/models.py Outdated
queries = Query.query.filter(Query.schedule_keep_results != None).order_by(Query.schedule_keep_results.desc())
if queries.first() and queries[0].schedule_keep_results:
resultsets = QueryResultSet.query.filter(QueryResultSet.query_rel == queries[0]).order_by(QueryResultSet.result_id)
c = resultsets.count()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please a longer variable name.

redash/models.py Outdated
n_to_delete = c - queries[0].schedule_keep_results
r_ids = [r.result_id for r in resultsets][:n_to_delete]
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
print "one", delete_count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

redash/models.py Outdated
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
print "one", delete_count
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)
for q in queries[1:]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why there is another loop here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments added

q = self.factory.create_query(query_text=qtxt, schedule_keep_results=3)
qr0 = self.factory.create_query_result(
query_text=qtxt,
data = json.dumps({'columns': ['name', 'color'],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..data=json.dumps..

@washort washort force-pushed the incremental-jobs-35 branch from 4ec4ac6 to 11be740 Compare March 8, 2018 06:09
@washort
Copy link
Author

washort commented Mar 8, 2018

I've addressed the issues you mentioned. Before merging this we'll need to roll staging db back by one migration.

@washort washort dismissed jezdez’s stale review March 12, 2018 14:20

changes made

@washort washort force-pushed the incremental-jobs-35 branch from 11be740 to 16f730a Compare March 12, 2018 14:32
@emtwo emtwo self-assigned this Mar 21, 2018
Copy link

@emtwo emtwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the dropdown for number of query results to keep disabled when a refresh schedule isn't set?

I'm also wondering if maybe we can put some constraint on that value because there is no maximum for it at the moment.

@@ -200,7 +200,7 @@ function preparePieData(seriesList, options) {
labels,
type: 'pie',
hole: 0.4,
marker: { colors: ColorPaletteArray },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. Did this change make it into this PR by accident?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... that revision shouldn't be in this branch, you're right. rebase gets things right most of the time and I wasn't vigilant last time I pushed this branch.


def downgrade():
op.drop_column(u'queries', 'schedule_keep_results')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was schedule_keep_results an unused field before? I don't see any other references to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration went into master before it should have -- the field was misnamed. We're going to roll back the misnamed field on stage before this PR gets merged.

# Synthesize a result set from the last N results.
total = len(query.query_results)
offset = max(total - query.schedule_resultset_size, 0)
results = [qr.to_dict() for qr in query.query_results[offset:offset + total]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't offset + total potentially index the query.query_results list outside of its length if offset > 0 since total = len(query.query_results)? Would [offset:] make more sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

if not results:
aggregate_result = {}
else:
aggregate_result = results[0].copy()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you copied the first result only here because you later replace the data attribute with the newly computed results above. Perhaps this can be less confusing to read with just a brief comment to explain this?

redash/models.py Outdated
def delete_stale_resultsets(cls):
delete_count = 0
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())
# Multiple queries with the same text may request multiple result sets
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume when you say queries with the same text you mean same SQL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. (Though of course this is used for non-SQL data sources as well.)

redash/models.py Outdated
# be kept. We start with the one that keeps the most, and delete both
# the unneeded bridge rows and result sets.
first_query = queries.first()
if first_query is not None and queries[0].schedule_resultset_size:
Copy link

@emtwo emtwo Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New to sqlalchemy here, how is queries.first() different from queries[0]? Can we use the first_query variable throughout instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[http://docs.sqlalchemy.org/en/rel_1_1/orm/query.html#sqlalchemy.orm.query.Query.first](Return the first result of this Query or None if the result doesn’t contain any row.)

You're right that it's a bit odd to then use queries[0] here; fixed.

redash/models.py Outdated
@classmethod
def delete_stale_resultsets(cls):
delete_count = 0
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if schedule_resultset_size was set by a user and we captured a bunch of results. Then it was unset by the user. Would this mean we wouldn't look at this stale data here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Those are handled before this is called, in cleanup_query_results.

@washort washort force-pushed the incremental-jobs-35 branch from ddd9e8f to 49e6c5d Compare March 23, 2018 18:47
@washort
Copy link
Author

washort commented Mar 23, 2018

Fixed a few of the things you mentioned.

redash/models.py Outdated
delete_count = 0
queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())
# Multiple queries with the same text may request multiple result sets
# be kept. We start with the one that keeps the most, and delete both
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we start with the one that keeps the most because we want to limit how much deleting is done in cleanup_query_results()? If so, maybe a comment on that? I was confused at first why we only look at the first query but the deleting limit listed in cleanup_query_results() is my best guess at why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with the one that keeps the most because if we start with any others it'd delete results that need to be kept.

redash/models.py Outdated
n_to_delete = resultset_count - first_query.schedule_resultset_size
r_ids = [r.result_id for r in resultsets][:n_to_delete]
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these deletes don't contribute to the delete_count but the QueryResultSet deletes do? It seems that in cleanup_query_results() originally only QueryResult deletes were being counted so maybe these need to be counted too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea how I made that mistake. Fixed.

redash/models.py Outdated
delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)
# Delete unneeded bridge rows for the remaining queries.
for q in queries[1:]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what's happening in this loop. My understanding is that we look at the query with the largest allowable query results and delete its stale results and bridge rows like you said. That is what happens in the code above.

Since other queries may have pointed to the stale results that were deleted, the bridge rows should be deleted for those too? That's what I would have expected this loop to be doing. But it seems to be looking at whether queries that have more stale results, similar to the check that was done for the first_query?

I think in general there is some code in this loop that seems to be similar to what is done to the first_query and I'm wondering if factoring some of it out into a separate function might make it clearer or if it's possible include the operations on first_query in this loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this point, when the loop starts, there are no stale result sets left. (I'll add a comment pointing this out.)

So all that has to be deleted are bridge rows for queries that have requested fewer resultsets be kept than the first one.

jezdez pushed a commit that referenced this pull request May 15, 2019
washort pushed a commit that referenced this pull request Jun 10, 2019
jezdez pushed a commit that referenced this pull request Jun 13, 2019
washort pushed a commit that referenced this pull request Jun 27, 2019
washort pushed a commit that referenced this pull request Jun 28, 2019
emtwo pushed a commit that referenced this pull request Jul 15, 2019
emtwo pushed a commit that referenced this pull request Jul 17, 2019
jezdez pushed a commit that referenced this pull request Aug 12, 2019
jezdez pushed a commit that referenced this pull request Aug 14, 2019
jezdez pushed a commit that referenced this pull request Aug 19, 2019
washort pushed a commit that referenced this pull request Sep 16, 2019
emtwo pushed a commit that referenced this pull request Nov 5, 2019
jezdez pushed a commit that referenced this pull request Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

re:dash should support incremental scheduled jobs
4 participants