Aggregate query results (re #35) #339

washort · 2018-02-20T22:14:21Z

This provides UI for keeping multiple query results and backend support for storing and aggregating them (and deleting them when no longer needed).

Still needs frontend work for choosing to display aggregated results.

fixes #35

washort · 2018-02-23T16:37:09Z

Added frontend support.

jezdez

This goes in the right direction and needs some wording and consistency fixes. I've left some questions where I didn't know what the code is intended to do. I think leaving some code comments and docstrings for the new code would be useful.

jezdez · 2018-03-06T18:56:57Z

client/app/components/queries/schedule-dialog.html

@@ -19,4 +19,7 @@ <h4 class="modal-title">Refresh Schedule</h4>
      Stop scheduling at date/time (format yyyy-MM-ddTHH:mm:ss, like 2016-12-28T14:57:00):
      <schedule-until query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-until>
    </label>
+    <label>
+      Number of result sets to keep <schedule-keep-results query="$ctrl.query" save-query="$ctrl.saveQuery"></schedule-keep-results>


I think this needs to be more descriptive Number of query results to keep (leave empty to not keep more than the recent result) or something like this?

Kind of a small space. I've added a check so that entering 1 or leaving it blank do the same thing.

jezdez · 2018-03-06T18:58:37Z

client/app/components/queries/schedule-dialog.js

+      query: '=',
+      saveQuery: '=',
+    },
+    template: '<input type="number" class="form-control" ng-model="query.schedule_keep_results" ng-change="saveQuery()">',


This is the directive used in schedule-dialog.html above?

jezdez · 2018-03-06T18:58:50Z

client/app/components/queries/schedule-dialog.js

 const ScheduleForm = {
  controller() {
    this.query = this.resolve.query;
    this.saveQuery = this.resolve.saveQuery;
-
+    this.isIncremental = false;


Is this needed?

jezdez · 2018-03-06T19:06:39Z

client/app/services/query-result.js

@@ -54,6 +54,7 @@ function addPointToSeries(point, seriesCollection, seriesName) {

 function QueryResultService($resource, $timeout, $q) {
  const QueryResultResource = $resource('api/query_results/:id', { id: '@id' }, { post: { method: 'POST' } });
+  const QueryAggregateResultResource = $resource('api/queries/:id/aggregate_results', { id: '@id' });


I think we should keep the terminology straight, let's use QueryResultSetResource instead of QueryAggregateResultResource to stay consistent with the Python API and 'api/queries/:id/resultset' for the URL. I'm not sure if the term "aggregated" wouldn't confuse people that it's some kind of database aggregation involved while it's really just a list of query results we refer to.

jezdez · 2018-03-06T19:15:02Z

client/app/services/query-result.js

@@ -421,6 +422,15 @@ function QueryResultService($resource, $timeout, $q) {
      return queryResult;
    }

+    static getAggregate(queryId) {


s/getAggregate/getResultSet/g

jezdez · 2018-03-06T19:37:10Z

redash/models.py

+class QueryResultSet(db.Model):
+    query_id = Column(db.Integer, db.ForeignKey("queries.id"),
+                      primary_key=True)
+    query_rel = db.relationship(Query)


I think this should just be query, no need for the extra suffix.

The suffix is to avoid collision with db.Model.query.

jezdez · 2018-03-06T19:37:46Z

redash/models.py

+        queries = Query.query.filter(Query.schedule_keep_results != None).order_by(Query.schedule_keep_results.desc())
+        if queries.first() and queries[0].schedule_keep_results:
+            resultsets = QueryResultSet.query.filter(QueryResultSet.query_rel == queries[0]).order_by(QueryResultSet.result_id)
+            c = resultsets.count()


Please a longer variable name.

jezdez · 2018-03-06T19:38:08Z

redash/models.py

+                n_to_delete = c - queries[0].schedule_keep_results
+                r_ids = [r.result_id for r in resultsets][:n_to_delete]
+                delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
+                print "one", delete_count


jezdez · 2018-03-06T19:38:49Z

redash/models.py

+                delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
+                print "one", delete_count
+                QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)
+        for q in queries[1:]:


I don't understand why there is another loop here?

comments added

jezdez · 2018-03-06T19:39:31Z

tests/handlers/test_queries.py

+        q = self.factory.create_query(query_text=qtxt, schedule_keep_results=3)
+        qr0 = self.factory.create_query_result(
+            query_text=qtxt,
+            data = json.dumps({'columns': ['name', 'color'],


..data=json.dumps..

washort · 2018-03-08T06:10:54Z

I've addressed the issues you mentioned. Before merging this we'll need to roll staging db back by one migration.

changes made

emtwo

Should we make the dropdown for number of query results to keep disabled when a refresh schedule isn't set?

I'm also wondering if maybe we can put some constraint on that value because there is no maximum for it at the moment.

emtwo · 2018-03-22T18:46:04Z

client/app/visualizations/chart/plotly/utils.js

@@ -200,7 +200,7 @@ function preparePieData(seriesList, options) {
      labels,
      type: 'pie',
      hole: 0.4,
-      marker: { colors: ColorPaletteArray },


huh. Did this change make it into this PR by accident?

... that revision shouldn't be in this branch, you're right. rebase gets things right most of the time and I wasn't vigilant last time I pushed this branch.

emtwo · 2018-03-22T18:48:02Z

migrations/versions/9d7678c47452_.py


 def downgrade():
-    op.drop_column(u'queries', 'schedule_keep_results')


Was schedule_keep_results an unused field before? I don't see any other references to it.

This migration went into master before it should have -- the field was misnamed. We're going to roll back the misnamed field on stage before this PR gets merged.

emtwo · 2018-03-22T20:04:37Z

redash/handlers/query_results.py

+        # Synthesize a result set from the last N results.
+        total = len(query.query_results)
+        offset = max(total - query.schedule_resultset_size, 0)
+        results = [qr.to_dict() for qr in query.query_results[offset:offset + total]]


Wouldn't offset + total potentially index the query.query_results list outside of its length if offset > 0 since total = len(query.query_results)? Would [offset:] make more sense?

Good catch.

emtwo · 2018-03-22T20:28:01Z

redash/handlers/query_results.py

+        if not results:
+            aggregate_result = {}
+        else:
+            aggregate_result = results[0].copy()


I see why you copied the first result only here because you later replace the data attribute with the newly computed results above. Perhaps this can be less confusing to read with just a brief comment to explain this?

emtwo · 2018-03-22T21:16:23Z

redash/models.py

+    def delete_stale_resultsets(cls):
+        delete_count = 0
+        queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())
+        # Multiple queries with the same text may request multiple result sets


I assume when you say queries with the same text you mean same SQL?

Yes. (Though of course this is used for non-SQL data sources as well.)

emtwo · 2018-03-22T21:23:59Z

redash/models.py

+        # be kept. We start with the one that keeps the most, and delete both
+        # the unneeded bridge rows and result sets.
+        first_query = queries.first()
+        if first_query is not None and queries[0].schedule_resultset_size:


New to sqlalchemy here, how is queries.first() different from queries[0]? Can we use the first_query variable throughout instead?

[http://docs.sqlalchemy.org/en/rel_1_1/orm/query.html#sqlalchemy.orm.query.Query.first](Return the first result of this Query or None if the result doesn’t contain any row.)

You're right that it's a bit odd to then use queries[0] here; fixed.

emtwo · 2018-03-23T16:21:34Z

redash/models.py

+    @classmethod
+    def delete_stale_resultsets(cls):
+        delete_count = 0
+        queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())


What if schedule_resultset_size was set by a user and we captured a bunch of results. Then it was unset by the user. Would this mean we wouldn't look at this stale data here?

Correct. Those are handled before this is called, in cleanup_query_results.

washort · 2018-03-23T18:48:49Z

Fixed a few of the things you mentioned.

emtwo · 2018-03-23T19:13:50Z

redash/models.py

+        delete_count = 0
+        queries = Query.query.filter(Query.schedule_resultset_size != None).order_by(Query.schedule_resultset_size.desc())
+        # Multiple queries with the same text may request multiple result sets
+        # be kept. We start with the one that keeps the most, and delete both


Do we start with the one that keeps the most because we want to limit how much deleting is done in cleanup_query_results()? If so, maybe a comment on that? I was confused at first why we only look at the first query but the deleting limit listed in cleanup_query_results() is my best guess at why.

We start with the one that keeps the most because if we start with any others it'd delete results that need to be kept.

emtwo · 2018-03-23T19:16:02Z

redash/models.py

+                n_to_delete = resultset_count - first_query.schedule_resultset_size
+                r_ids = [r.result_id for r in resultsets][:n_to_delete]
+                delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
+                QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)


Is there a reason why these deletes don't contribute to the delete_count but the QueryResultSet deletes do? It seems that in cleanup_query_results() originally only QueryResult deletes were being counted so maybe these need to be counted too?

No idea how I made that mistake. Fixed.

emtwo · 2018-03-23T19:24:55Z

redash/models.py

+                delete_count = QueryResultSet.query.filter(QueryResultSet.result_id.in_(r_ids)).delete(synchronize_session=False)
+                QueryResult.query.filter(QueryResult.id.in_(r_ids)).delete(synchronize_session=False)
+        # Delete unneeded bridge rows for the remaining queries.
+        for q in queries[1:]:


I'm not sure I follow what's happening in this loop. My understanding is that we look at the query with the largest allowable query results and delete its stale results and bridge rows like you said. That is what happens in the code above.

Since other queries may have pointed to the stale results that were deleted, the bridge rows should be deleted for those too? That's what I would have expected this loop to be doing. But it seems to be looking at whether queries that have more stale results, similar to the check that was done for the first_query?

I think in general there is some code in this loop that seems to be similar to what is done to the first_query and I'm wondering if factoring some of it out into a separate function might make it clearer or if it's possible include the operations on first_query in this loop?

By this point, when the loop starts, there are no stale result sets left. (I'll add a comment pointing this out.)

So all that has to be deleted are bridge rows for queries that have requested fewer resultsets be kept than the first one.

rafrombrc added the in review label Feb 21, 2018

rafrombrc added this to the 13 milestone Feb 21, 2018

washort force-pushed the master branch from bf7370c to 8241967 Compare February 22, 2018 04:46

washort force-pushed the incremental-jobs-35 branch 2 times, most recently from 8654b03 to ba2f39a Compare February 23, 2018 16:36

washort force-pushed the master branch from 9666ed6 to 41aba63 Compare February 28, 2018 22:39

washort requested a review from jezdez March 1, 2018 22:14

washort force-pushed the master branch from ecaaeda to ea9c225 Compare March 5, 2018 18:49

jezdez force-pushed the master branch 2 times, most recently from 4ae2fe6 to 80d9ab6 Compare March 5, 2018 21:35

washort force-pushed the incremental-jobs-35 branch 2 times, most recently from ea9c225 to e2e66c9 Compare March 6, 2018 18:35

jezdez previously requested changes Mar 6, 2018

View reviewed changes

washort pushed a commit that referenced this pull request Mar 6, 2018

Migration that needs to be rolled back before landing #339

0d579e9

washort force-pushed the incremental-jobs-35 branch from 4ec4ac6 to 11be740 Compare March 8, 2018 06:09

washort force-pushed the incremental-jobs-35 branch from 11be740 to 16f730a Compare March 12, 2018 14:32

emtwo self-assigned this Mar 21, 2018

emtwo reviewed Mar 22, 2018

View reviewed changes

emtwo reviewed Mar 23, 2018

View reviewed changes

Allen Short added 4 commits March 23, 2018 18:47

Aggregate query results (re #35)

4afd8a2

address review comments

32c5dc6

disable resultset-size field when not scheduled

8df71de

address @emtwo review issues

49e6c5d

washort force-pushed the incremental-jobs-35 branch from ddd9e8f to 49e6c5d Compare March 23, 2018 18:47

emtwo reviewed Mar 23, 2018

View reviewed changes

more fixes

25ecf76

jezdez pushed a commit that referenced this pull request May 15, 2019

Aggregate query results (re #35) (#339)

4418a75

washort pushed a commit that referenced this pull request Jun 10, 2019

Aggregate query results (re #35) (#339)

57ef836

jezdez pushed a commit that referenced this pull request Jun 13, 2019

Aggregate query results (re #35) (#339)

49b84bb

washort pushed a commit that referenced this pull request Jun 27, 2019

Aggregate query results (re #35) (#339)

f42e8a1

washort pushed a commit that referenced this pull request Jun 28, 2019

Aggregate query results (re #35) (#339)

ddf0f23

emtwo pushed a commit that referenced this pull request Jul 15, 2019

Aggregate query results (re #35) (#339)

837194e

emtwo pushed a commit that referenced this pull request Jul 17, 2019

Aggregate query results (re #35) (#339)

0b12ae3

jezdez pushed a commit that referenced this pull request Aug 12, 2019

Aggregate query results (re #35) (#339)

334c43c

jezdez pushed a commit that referenced this pull request Aug 14, 2019

Aggregate query results (re #35) (#339)

f935f57

jezdez pushed a commit that referenced this pull request Aug 19, 2019

Aggregate query results (re #35) (#339)

b104aae

washort pushed a commit that referenced this pull request Sep 16, 2019

Aggregate query results (re #35) (#339)

c0c3075

emtwo pushed a commit that referenced this pull request Nov 5, 2019

Aggregate query results (re #35) (#339)

090ff1a

jezdez pushed a commit that referenced this pull request Jan 16, 2020

Aggregate query results (re #35) (#339)

febc003

snyk-bot mentioned this pull request Aug 18, 2021

[Snyk] Security upgrade copy-webpack-plugin from 4.6.0 to 5.0.0 MaxMood96/redash#12

Open

MaxMood96 mentioned this pull request Nov 5, 2022

[Snyk] Fix for 1 vulnerabilities MaxMood96/redash#62

Open

MaxMood96 mentioned this pull request Dec 26, 2022

[Snyk] Fix for 1 vulnerabilities MaxMood96/redash#89

Open


		def downgrade():
		op.drop_column(u'queries', 'schedule_keep_results')

Aggregate query results (re #35) #339

Aggregate query results (re #35) #339

Conversation

washort commented Feb 20, 2018 • edited Loading

washort commented Feb 23, 2018

jezdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

washort commented Mar 8, 2018

emtwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emtwo Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

washort commented Mar 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

washort commented Feb 20, 2018 •

edited

Loading

emtwo Mar 22, 2018 •

edited

Loading