Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the invalid query to use an indexed key #638

Merged
merged 1 commit into from
Mar 5, 2019

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Mar 1, 2019

This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.

Some background:

  • the timeseries interface allows users to specify a set of keys that they are querying

  • the underlying mongodb implementation splits the keys into the
    timeseries_db which stores raw data, and analysis_timeseries_db which
    stores processed data

  • these were originally in the same database. so when we split them, to make
    sure that we don't lose any data inadvertently, we query both collections and
    merge the results

  • in response to a query, instead of returning the full results in memory,
    mongodb returns a cursor that you can iterate over

  • To ensure that we didn't have to read the entire results into memory every
    time, we chained the values returned from the two queries
    5367a01

    return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
    

so far so good. but then we ran into a series of bugs that we fixed by building
on each other.

  1. If the entries are only in one database, the other database is queried with
    an empty array for the key, which returns all values
    (Decide on proper behavior of TimeSeries.find_entries() e-mission-docs#168)
    • so we added a check - if there are no keys queried, we return an empty
      iterator that can be chained
      b7f835a
  2. But then, the empty iterator is not a cursor, so we can't display counts
    returned from each database (Extend usercache load/store to the multi-user timelines as well #599)
    • we fix this by performing an invalid query so that we get an empty cursor (14aa503)
    • This is purely a nice to have, and the PR even says that the changes to
      enable it can be reverted if needed.
    • But the changes were correct, and the tests passed so we retained them

However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over the records and determine that there were no "1"s.

Switching from "1" -> "metadata.key", which is indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.

e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)

This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.

Some background:
- the timeseries interface allows users to specify a set of keys that they are querying
- the underlying mongodb implementation splits the keys into the
  `timeseries_db` which stores raw data, and `analysis_timeseries_db` which
  stores processed data
- these were originally in the same database. so when we split them, to make
  sure that we don't lose any data inadvertently, we query both collections and
  merge the results
- in response to a query, instead of returning the full results in memory,
  mongodb returns a cursor that you can iterate over
- To ensure that we didn't have to read the entire results into memory every
  time, we chained the values returned from the two queries
    e-mission@5367a01

    ```
    return itertools.chain(orig_ts_db_result, analysis_ts_db_result)
    ```

so far so good. but then we ran into a series of bugs that we fixed by building
on each other.

1. If the entries are only in one database, the other database is queried with
   an empty array for the key, which returns all values
    (e-mission/e-mission-docs#168)
    - so we added a check - if there are no keys queried, we return an empty
        iterator that can be chained
        e-mission@b7f835a
1. But then, the empty iterator is not a cursor, so we can't display counts
    returned from each database (e-mission#599)
    - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503)
    - This is purely a nice to have, and the PR even says that the changes to
      enable it can be reverted if needed.
    - But the changes were correct, and the tests passed so we retained them

However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over record and determine that there were no "1"s.

Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.

e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)
@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

@jf87 do you want to do the review?

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/e-mission-server-prb/1083/
Test FAILed.

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

jenkins, test this please

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/e-mission-server-prb/1084/
Test PASSed.

@shankari
Copy link
Contributor Author

shankari commented Mar 1, 2019

@jf87 will wait for your review for until tonight Pacific time and will merge after that

@jf87
Copy link

jf87 commented Mar 5, 2019

@shankari Again, sorry for the delay. And again thanks for investigating this.
Your explanations and code change make sense. I just applied it to my running server and things seem to be much snappier :-)
I can try and think if there is a less hacky way than the invalid query when I get more time to look at this, but for now I am super happy.

@shankari
Copy link
Contributor Author

shankari commented Mar 5, 2019

I can try and think if there is a less hacky way than the invalid query when I get more time to look at this...

I think the key is the decision for the timeseries interface below

the timeseries interface allows users to specify a set of keys that they are querying

Now that I have used the timeseries interface for a bit, I don't think that I have used the 'set of keys' property that much. If we changed the interface and allowed users to specify one key at a time, we would not need to merge anything. And of course, if users did want to get a 'set of keys', they could perform n queries, one for each key.

@shankari shankari merged commit 7df0bb7 into e-mission:master Mar 5, 2019
@shankari shankari deleted the improve_scalability_by_99_pct branch December 2, 2020 17:30
jf87 pushed a commit to jf87/e-mission-server that referenced this pull request Jun 21, 2021
…_99_pct

Fix the invalid query to use an indexed key
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants