-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the invalid query to use an indexed key #638
Fix the invalid query to use an indexed key #638
Conversation
This one line fix improves performance by 99% on large databases :) And it is also an object lesson in the law of unintended consequences, so here's a summary to use as a test case. Some background: - the timeseries interface allows users to specify a set of keys that they are querying - the underlying mongodb implementation splits the keys into the `timeseries_db` which stores raw data, and `analysis_timeseries_db` which stores processed data - these were originally in the same database. so when we split them, to make sure that we don't lose any data inadvertently, we query both collections and merge the results - in response to a query, instead of returning the full results in memory, mongodb returns a cursor that you can iterate over - To ensure that we didn't have to read the entire results into memory every time, we chained the values returned from the two queries e-mission@5367a01 ``` return itertools.chain(orig_ts_db_result, analysis_ts_db_result) ``` so far so good. but then we ran into a series of bugs that we fixed by building on each other. 1. If the entries are only in one database, the other database is queried with an empty array for the key, which returns all values (e-mission/e-mission-docs#168) - so we added a check - if there are no keys queried, we return an empty iterator that can be chained e-mission@b7f835a 1. But then, the empty iterator is not a cursor, so we can't display counts returned from each database (e-mission#599) - we fix this by performing an invalid query so that we get an empty cursor (e-mission@14aa503) - This is purely a nice to have, and the PR even says that the changes to enable it can be reverted if needed. - But the changes were correct, and the tests passed so we retained them However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an index in the database on the key "1". So as the database size grew, mongodb was taking 45 seconds to iterate over record and determine that there were no "1"s. Switching from "1" -> "metadata.key", which *is* indexed, dramatically improves performance from 2 mins 6 secs to 150 ms. e-mission/e-mission-docs#261 (comment) to e-mission/e-mission-docs#261 (comment)
@jf87 do you want to do the review? |
Merged build finished. Test FAILed. |
Test FAILed. |
jenkins, test this please |
Merged build finished. Test PASSed. |
Test PASSed. |
@jf87 will wait for your review for until tonight Pacific time and will merge after that |
@shankari Again, sorry for the delay. And again thanks for investigating this. |
I think the key is the decision for the timeseries interface below
Now that I have used the timeseries interface for a bit, I don't think that I have used the 'set of keys' property that much. If we changed the interface and allowed users to specify one key at a time, we would not need to merge anything. And of course, if users did want to get a 'set of keys', they could perform |
…_99_pct Fix the invalid query to use an indexed key
This one line fix improves performance by 99% on large databases :) And it is
also an object lesson in the law of unintended consequences, so here's a
summary to use as a test case.
Some background:
the timeseries interface allows users to specify a set of keys that they are querying
the underlying mongodb implementation splits the keys into the
timeseries_db
which stores raw data, andanalysis_timeseries_db
whichstores processed data
these were originally in the same database. so when we split them, to make
sure that we don't lose any data inadvertently, we query both collections and
merge the results
in response to a query, instead of returning the full results in memory,
mongodb returns a cursor that you can iterate over
To ensure that we didn't have to read the entire results into memory every
time, we chained the values returned from the two queries
5367a01
so far so good. but then we ran into a series of bugs that we fixed by building
on each other.
an empty array for the key, which returns all values
(Decide on proper behavior of TimeSeries.find_entries() e-mission-docs#168)
iterator that can be chained
b7f835a
returned from each database (Extend usercache load/store to the multi-user timelines as well #599)
enable it can be reverted if needed.
However, the INVALID_QUERY that we used was {"1": "2"}, and we do not have an
index in the database on the key "1". So as the database size grew, mongodb was
taking 45 seconds to iterate over the records and determine that there were no "1"s.
Switching from "1" -> "metadata.key", which is indexed, dramatically improves
performance from 2 mins 6 secs to 150 ms.
e-mission/e-mission-docs#261 (comment)
to
e-mission/e-mission-docs#261 (comment)