DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

dhirving · 2024-11-04T23:14:58Z

Added a server-side endpoint to handle query_all_datasets in a single request. query_all_datasets can potentially involve hundreds or thousands of separate dataset queries, and we don't want clients slamming the server with that many HTTP requests.

The new endpoint streams results in the same manner as the existing query endpoints used by QueryDriver, but it is separate from the Query/QueryDriver framework.

This is not yet used in the CLI tools and Butler._query_all_datasets is still private -- we need to deploy an updated server with this change before we can release the client side.

--order-by in the CLI tools is now restricted to queries for a single dataset type -- future implementations of query_all_datasets may not support it.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-11-04T23:28:55Z

Codecov Report

Attention: Patch coverage is 98.88060% with 3 lines in your changes missing coverage. Please review.

Project coverage is 89.44%. Comparing base (3672820) to head (3ffbb4b).
Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
python/lsst/daf/butler/_query_all_datasets.py	92.85%	1 Missing and 1 partial ⚠️
...on/lsst/daf/butler/remote_butler/_query_results.py	95.45%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1114      +/-   ##
==========================================
+ Coverage   89.42%   89.44%   +0.02%     
==========================================
  Files         363      366       +3     
  Lines       48444    48604     +160     
  Branches     5879     5890      +11     
==========================================
+ Hits        43319    43475     +156     
- Misses       3716     3717       +1     
- Partials     1409     1412       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo

Looks good.

I'm a little worried some of this will need refactoring when we add planned query interfaces like counts of datasets by type, but that may reflect my own tendency to try to prototype everything out before implementing anything rather than a concrete concern.

I'm glad you kept the restoration of dimension records in query_all_datasets private, because I suspect what transferDatasets really wants is some sort of normalized batch of all dimension records relevant for a bunch of datasets, not actual expanded data IDs; that's also what QG generation really wants, and eventually I hope we can provide it rather than creating duplicate copies of a lot of those dimension records by denormalizing them into the result rows.

TallJimbo · 2024-11-14T14:56:41Z

python/lsst/daf/butler/script/queryDatasets.py

+                    if warn_limit and limit is not None and datasets_found >= limit:
+                        # We asked for one too many so must remove that from
+                        # the list.
+                        refs = refs[0:-1]


This looks like an O(N) way to do a deletion from a list, while del refs[-1] would be O(1). Is there some reason the original list shouldn't be modified?

That line is unchanged from the previous revision. But the full history of it is that Tim had it as .pop() originally, which would be O(1). But when I pulled out query_all_datasets that introduced a bug, because query_all_datasets is a coroutine and this function was mutating the list before query_all_datasets was done with it. So since this list is at most a few hundred thousand items and is sandwiched in between network I/O and console I/O I thought it was better to prevent similar bugs in the future by avoiding the mutation than to worry about like 1ms in a CLI command that takes 2 seconds to even start up.

TallJimbo · 2024-11-14T15:05:07Z

python/lsst/daf/butler/remote_butler/server/handlers/_external_query.py

+        """
+
+
+class _StreamQueryDriverExecute(StreamingQuery):


I think you might need to repeat the [_TContext] generic parameterization here to avoid having it implicitly interpreted as StreamingQuery[Any], but I'm not sure.

TallJimbo · 2024-11-14T15:13:48Z

python/lsst/daf/butler/remote_butler/server/handlers/_external.py

@@ -27,6 +27,8 @@

 from __future__ import annotations

+from lsst.daf.butler.remote_butler.server.handlers._utils import set_default_data_id


This import should move down into the main section (I bet VSCode added it for you).

I'm also curious why this and many other imports here are absolute rather than relative.

I think at the time they were added, there was a theory that the Butler server code was supposed to move into a separate repository at some point. Made them all relative since I don't think we want to do that anymore.

Switch from the query_datasets convenience method to the advanced query system in query_all_datasets. This lets us get the results one page at a time, which will be needed to prevent memory exhaustion when running these queries on the server.

It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.

The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.

The upcoming implementation of query_all_datasets will not support order_by, so remove it. This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.

In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.

After the refactor in the previous commit, this is somewhat independent of the query routes.

This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.

This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.

query_all_datasets can potentially involve hundreds or thousands of separate dataset queries. We don't want clients slamming the server with that many HTTP requests, so add a server-side endpoint that can handle these queries in a single request.

It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included. So add back `with_dimension_records` to the internal implementation of query_all_datasets.

dhirving changed the base branch from main to tickets/DM-45873 November 4, 2024 23:15

dhirving force-pushed the tickets/DM-47375 branch 3 times, most recently from 03351be to 79c520d Compare November 5, 2024 22:54

dhirving force-pushed the tickets/DM-45873 branch from 2cd8e3e to c077b29 Compare November 8, 2024 21:23

Base automatically changed from tickets/DM-45873 to main November 8, 2024 23:31

dhirving force-pushed the tickets/DM-47375 branch from 07afc52 to 32c647e Compare November 12, 2024 21:15

dhirving marked this pull request as ready for review November 12, 2024 22:22

TallJimbo approved these changes Nov 14, 2024

View reviewed changes

dhirving added 12 commits November 14, 2024 10:55

Remove with_dimension_records from query-datasets

6b10874

It turns out that the query-datasets CLI was not actually using dimension records, and it will simplify the implementation to not support this.

Restrict --order-by in query-datasets to single type

8da61ab

The backend for querying multiple dataset types will not support "order by", so restrict the CLI to match the implementation.

Remove order_by from query_all_datasets

0a92c87

The upcoming implementation of query_all_datasets will not support order_by, so remove it. This requires modifying the query-datasets CLI to use the single dataset type query_datasets when order by needs to be supported.

Make streaming query logic reusable

f3e2e9d

In preparation for implementing query_all_datasets on the server, make the streaming response and timeout logic from the existing query handler re-usable.

Move query streaming logic to its own file

7aaee6c

After the refactor in the previous commit, this is somewhat independent of the query routes.

Move query streaming client code to its own file

5c4d54d

This will be shared by the RemoteButler query_all_datasets implementation in an upcoming commit.

Define a dataclass for query_all_datasets args

f5aa116

This will be used in an upcoming commit to prevent excessive duplication of function parameters between implementations of query_all_datasets.

Add back dimension records to QueryDatasets

c9fbdb9

It turns out the QueryDatasets class is shared by multiple CLI scripts, some of which need dimension records included. So add back `with_dimension_records` to the internal implementation of query_all_datasets.

Fix typing to prevent generic from decaying to Any

d40596b

Clean up imports

3ffbb4b

dhirving force-pushed the tickets/DM-47375 branch from d2f60b2 to 3ffbb4b Compare November 14, 2024 18:22

dhirving merged commit 53005fb into main Nov 14, 2024
19 checks passed

dhirving deleted the tickets/DM-47375 branch November 14, 2024 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

dhirving commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading

TallJimbo left a comment

TallJimbo Nov 14, 2024

dhirving Nov 14, 2024

TallJimbo Nov 14, 2024

TallJimbo Nov 14, 2024

dhirving Nov 14, 2024

		@@ -27,6 +27,8 @@

		from __future__ import annotations

		from lsst.daf.butler.remote_butler.server.handlers._utils import set_default_data_id

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

DM-47375: Run query_all_datasets in a single request for RemoteButler #1114

Conversation

dhirving commented Nov 4, 2024 • edited Loading

Checklist

codecov bot commented Nov 4, 2024 • edited Loading

Codecov Report

TallJimbo left a comment

Choose a reason for hiding this comment

TallJimbo Nov 14, 2024

Choose a reason for hiding this comment

dhirving Nov 14, 2024

Choose a reason for hiding this comment

TallJimbo Nov 14, 2024

Choose a reason for hiding this comment

TallJimbo Nov 14, 2024

Choose a reason for hiding this comment

dhirving Nov 14, 2024

Choose a reason for hiding this comment

dhirving commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading