Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

Merged
merged 12 commits into from
Nov 14, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Nov 4, 2024

Added a server-side endpoint to handle query_all_datasets in a single request. query_all_datasets can potentially involve hundreds or thousands of separate dataset queries, and we don't want clients slamming the server with that many HTTP requests.

The new endpoint streams results in the same manner as the existing query endpoints used by QueryDriver, but it is separate from the Query/QueryDriver framework.

This is not yet used in the CLI tools and Butler._query_all_datasets is still private -- we need to deploy an updated server with this change before we can release the client side.

--order-by in the CLI tools is now restricted to queries for a single dataset type -- future implementations of query_all_datasets may not support it.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

@dhirving dhirving changed the base branch from main to tickets/DM-45873 November 4, 2024 23:15
Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 98.88060% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (3672820) to head (3ffbb4b).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
python/lsst/daf/butler/_query_all_datasets.py 92.85% 1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/remote_butler/_query_results.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1114      +/-   ##
==========================================
+ Coverage   89.42%   89.44%   +0.02%     
==========================================
  Files         363      366       +3     
  Lines       48444    48604     +160     
  Branches     5879     5890      +11     
==========================================
+ Hits        43319    43475     +156     
- Misses       3716     3717       +1     
- Partials     1409     1412       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-47375 branch 3 times, most recently from 03351be to 79c520d Compare November 5, 2024 22:54
Base automatically changed from tickets/DM-45873 to main November 8, 2024 23:31
@dhirving dhirving marked this pull request as ready for review November 12, 2024 22:22
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'm a little worried some of this will need refactoring when we add planned query interfaces like counts of datasets by type, but that may reflect my own tendency to try to prototype everything out before implementing anything rather than a concrete concern.

I'm glad you kept the restoration of dimension records in query_all_datasets private, because I suspect what transferDatasets really wants is some sort of normalized batch of all dimension records relevant for a bunch of datasets, not actual expanded data IDs; that's also what QG generation really wants, and eventually I hope we can provide it rather than creating duplicate copies of a lot of those dimension records by denormalizing them into the result rows.

if warn_limit and limit is not None and datasets_found >= limit:
# We asked for one too many so must remove that from
# the list.
refs = refs[0:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an O(N) way to do a deletion from a list, while del refs[-1] would be O(1). Is there some reason the original list shouldn't be modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line is unchanged from the previous revision. But the full history of it is that Tim had it as .pop() originally, which would be O(1). But when I pulled out query_all_datasets that introduced a bug, because query_all_datasets is a coroutine and this function was mutating the list before query_all_datasets was done with it. So since this list is at most a few hundred thousand items and is sandwiched in between network I/O and console I/O I thought it was better to prevent similar bugs in the future by avoiding the mutation than to worry about like 1ms in a CLI command that takes 2 seconds to even start up.

"""


class _StreamQueryDriverExecute(StreamingQuery):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need to repeat the [_TContext] generic parameterization here to avoid having it implicitly interpreted as StreamingQuery[Any], but I'm not sure.

@@ -27,6 +27,8 @@

from __future__ import annotations

from lsst.daf.butler.remote_butler.server.handlers._utils import set_default_data_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should move down into the main section (I bet VSCode added it for you).

I'm also curious why this and many other imports here are absolute rather than relative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at the time they were added, there was a theory that the Butler server code was supposed to move into a separate repository at some point. Made them all relative since I don't think we want to do that anymore.

Switch from the query_datasets convenience method to the advanced query system in query_all_datasets.

This lets us get the results one page at a time, which will be needed to prevent memory exhaustion when running these queries on the server.
It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.
The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.
The upcoming implementation of query_all_datasets will not support order_by, so remove it.  This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.
In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.
After the refactor in the previous commit, this is somewhat independent of the query routes.
This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.
This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.
query_all_datasets can potentially involve hundreds or thousands of separate dataset queries.  We don't want clients slamming the server with that many HTTP requests, so add a server-side endpoint that can handle these queries in a single request.
It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included.  So add back `with_dimension_records` to the internal implementation of query_all_datasets.
@dhirving dhirving merged commit 53005fb into main Nov 14, 2024
19 checks passed
@dhirving dhirving deleted the tickets/DM-47375 branch November 14, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants