-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-47375: Run query_all_datasets in a single request for RemoteButler #1114
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1114 +/- ##
==========================================
+ Coverage 89.42% 89.44% +0.02%
==========================================
Files 363 366 +3
Lines 48444 48604 +160
Branches 5879 5890 +11
==========================================
+ Hits 43319 43475 +156
- Misses 3716 3717 +1
- Partials 1409 1412 +3 ☔ View full report in Codecov by Sentry. |
03351be
to
79c520d
Compare
2cd8e3e
to
c077b29
Compare
07afc52
to
32c647e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I'm a little worried some of this will need refactoring when we add planned query interfaces like counts of datasets by type, but that may reflect my own tendency to try to prototype everything out before implementing anything rather than a concrete concern.
I'm glad you kept the restoration of dimension records in query_all_datasets
private, because I suspect what transferDatasets
really wants is some sort of normalized batch of all dimension records relevant for a bunch of datasets, not actual expanded data IDs; that's also what QG generation really wants, and eventually I hope we can provide it rather than creating duplicate copies of a lot of those dimension records by denormalizing them into the result rows.
if warn_limit and limit is not None and datasets_found >= limit: | ||
# We asked for one too many so must remove that from | ||
# the list. | ||
refs = refs[0:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an O(N) way to do a deletion from a list, while del refs[-1]
would be O(1). Is there some reason the original list shouldn't be modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That line is unchanged from the previous revision. But the full history of it is that Tim had it as .pop()
originally, which would be O(1). But when I pulled out query_all_datasets
that introduced a bug, because query_all_datasets
is a coroutine and this function was mutating the list before query_all_datasets
was done with it. So since this list is at most a few hundred thousand items and is sandwiched in between network I/O and console I/O I thought it was better to prevent similar bugs in the future by avoiding the mutation than to worry about like 1ms in a CLI command that takes 2 seconds to even start up.
""" | ||
|
||
|
||
class _StreamQueryDriverExecute(StreamingQuery): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might need to repeat the [_TContext]
generic parameterization here to avoid having it implicitly interpreted as StreamingQuery[Any]
, but I'm not sure.
@@ -27,6 +27,8 @@ | |||
|
|||
from __future__ import annotations | |||
|
|||
from lsst.daf.butler.remote_butler.server.handlers._utils import set_default_data_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import should move down into the main section (I bet VSCode added it for you).
I'm also curious why this and many other imports here are absolute rather than relative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at the time they were added, there was a theory that the Butler server code was supposed to move into a separate repository at some point. Made them all relative since I don't think we want to do that anymore.
Switch from the query_datasets convenience method to the advanced query system in query_all_datasets. This lets us get the results one page at a time, which will be needed to prevent memory exhaustion when running these queries on the server.
It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.
The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.
The upcoming implementation of query_all_datasets will not support order_by, so remove it. This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.
In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.
After the refactor in the previous commit, this is somewhat independent of the query routes.
This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.
This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.
query_all_datasets can potentially involve hundreds or thousands of separate dataset queries. We don't want clients slamming the server with that many HTTP requests, so add a server-side endpoint that can handle these queries in a single request.
It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included. So add back `with_dimension_records` to the internal implementation of query_all_datasets.
d2f60b2
to
3ffbb4b
Compare
Added a server-side endpoint to handle
query_all_datasets
in a single request.query_all_datasets
can potentially involve hundreds or thousands of separate dataset queries, and we don't want clients slamming the server with that many HTTP requests.The new endpoint streams results in the same manner as the existing query endpoints used by
QueryDriver
, but it is separate from theQuery
/QueryDriver
framework.This is not yet used in the CLI tools and
Butler._query_all_datasets
is still private -- we need to deploy an updated server with this change before we can release the client side.--order-by
in the CLI tools is now restricted to queries for a single dataset type -- future implementations ofquery_all_datasets
may not support it.Checklist
doc/changes
configs/old_dimensions