Fix the invalid query to use an indexed key #638

shankari · 2019-03-01T17:03:41Z

This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.

Some background:

the timeseries interface allows users to specify a set of keys that they are querying
the underlying mongodb implementation splits the keys into the
timeseries_db which stores raw data, and analysis_timeseries_db which
stores processed data
these were originally in the same database. so when we split them, to make
sure that we don't lose any data inadvertently, we query both collections and
merge the results
in response to a query, instead of returning the full results in memory,
mongodb returns a cursor that you can iterate over
To ensure that we didn't have to read the entire results into memory every
time, we chained the values returned from the two queries
5367a01
```
return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
```

so far so good. but then we ran into a series of bugs that we fixed by building
on each other.

If the entries are only in one database, the other database is queried with
an empty array for the key, which returns all values
(Decide on proper behavior of TimeSeries.find_entries() e-mission-docs#168)
- so we added a check - if there are no keys queried, we return an empty
  iterator that can be chained
  b7f835a
But then, the empty iterator is not a cursor, so we can't display counts
returned from each database (Extend usercache load/store to the multi-user timelines as well #599)
- we fix this by performing an invalid query so that we get an empty cursor (14aa503)
- This is purely a nice to have, and the PR even says that the changes to
  enable it can be reverted if needed.
- But the changes were correct, and the tests passed so we retained them

However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over the records and determine that there were no "1"s.

Switching from "1" -> "metadata.key", which is indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.

e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)

This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)

shankari · 2019-03-01T17:05:05Z

@jf87 do you want to do the review?

AmplabJenkins · 2019-03-01T17:28:53Z

Merged build finished. Test FAILed.

AmplabJenkins · 2019-03-01T17:28:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/e-mission-server-prb/1083/
Test FAILed.

shankari · 2019-03-01T17:39:30Z

jenkins, test this please

AmplabJenkins · 2019-03-01T18:05:30Z

Merged build finished. Test PASSed.

AmplabJenkins · 2019-03-01T18:05:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/e-mission-server-prb/1084/
Test PASSed.

shankari · 2019-03-01T19:28:17Z

@jf87 will wait for your review for until tonight Pacific time and will merge after that

jf87 · 2019-03-05T17:52:45Z

@shankari Again, sorry for the delay. And again thanks for investigating this.
Your explanations and code change make sense. I just applied it to my running server and things seem to be much snappier :-)
I can try and think if there is a less hacky way than the invalid query when I get more time to look at this, but for now I am super happy.

shankari · 2019-03-05T19:19:32Z

I can try and think if there is a less hacky way than the invalid query when I get more time to look at this...

I think the key is the decision for the timeseries interface below

the timeseries interface allows users to specify a set of keys that they are querying

Now that I have used the timeseries interface for a bit, I don't think that I have used the 'set of keys' property that much. If we changed the interface and allowed users to specify one key at a time, we would not need to merge anything. And of course, if users did want to get a 'set of keys', they could perform n queries, one for each key.

…_99_pct Fix the invalid query to use an indexed key

shankari mentioned this pull request Mar 1, 2019

quick fixes for scalability? e-mission/e-mission-docs#261

Closed

shankari merged commit 7df0bb7 into e-mission:master Mar 5, 2019

shankari deleted the improve_scalability_by_99_pct branch December 2, 2020 17:30

jf87 pushed a commit to jf87/e-mission-server that referenced this pull request Jun 21, 2021

Merge pull request e-mission#638 from shankari/improve_scalability_by…

f20f0b3

…_99_pct Fix the invalid query to use an indexed key

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the invalid query to use an indexed key #638

Fix the invalid query to use an indexed key #638

shankari commented Mar 1, 2019 •

edited

Loading

shankari commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

shankari commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

shankari commented Mar 1, 2019

jf87 commented Mar 5, 2019

shankari commented Mar 5, 2019

Fix the invalid query to use an indexed key #638

Fix the invalid query to use an indexed key #638

Conversation

shankari commented Mar 1, 2019 • edited Loading

shankari commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

shankari commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

AmplabJenkins commented Mar 1, 2019

shankari commented Mar 1, 2019

jf87 commented Mar 5, 2019

shankari commented Mar 5, 2019

shankari commented Mar 1, 2019 •

edited

Loading