Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacement functionality for get_all_items #237

Closed
TomAugspurger opened this issue Jun 8, 2022 · 18 comments
Closed

Replacement functionality for get_all_items #237

TomAugspurger opened this issue Jun 8, 2022 · 18 comments

Comments

@TomAugspurger
Copy link
Collaborator

(Followup to the deprecations from #206 which I haven't tested yet, so apologies if I missed something)

Over in https://github.com/microsoft/planetarycomputerexamples we use search.get_all_items() to get a single ItemCollection, which is then handed off to stackstac or odc.stac. Our use case requires having a concrete sequence of items ahead of time, not an iterator / iterable. What's the recommended replacement API? Something like

import pystac
import pystac_client

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

time_range = "2020-12-01/2020-12-31"
bbox = [-122.2751, 47.5469, -121.9613, 47.7458]

search = catalog.search(collections=["landsat-c2-l2"], bbox=bbox, datetime=time_range)

pystac.ItemCollection(search.get_items(), clone_items=False)  # the replacement?

I have a couple concerns with this replacement, if it's the recommended alternative:

  1. Previously, we had some surprisingly slow performance related to copying items and setting the right owner. It's not immediately clear to me whether the replacement does the right thing everywhere.
  2. I worry about how easy it would be forget clone_items=False and now our users will get slower results until they profile things.

Given how common this use case is (at least in my corner of the world), I think it makes sense to keep a method on the search object that returns a concrete ItemCollection.


As an aside, I don't actually see a warning when running the old code on the Planetary Computer Hub using pystac 0.4.0, since pystac-client is emitting a DeprecationWarning. It's changed over the years, but DeprecationWarning isn't always visible to the end user. I don't know if pystac-client has a formal policy, but pandas uses either a FutureWarning for deprecations that will affect user-code, and DeprecationWarning for code that will likely affect library code (which is then upgraded to a FutureWarning in a future release).

@gadomski
Copy link
Member

If a user wants a List[pystac.Item], they can

items = list(item_search.items())

If they want un-PySTAC'd dictionaries:

items = list(item_search.items_as_dicts())

After discussion with yourself, @lossyrob, @matthewhanson, and @pjhartzell, it seems like the consensus is that pystac-client fulfills the role of "user convenience library", i.e. its API should solve the 80% use-case for the average user, rather than being a strict implementation (PySTAC is the "strict implementation" library). If that's the case, then we should:

  1. Un-deprecate ItemSearch.get_all_items and ItemSearch.get_all_items_as_dict
    • Question: should we keep those methods deprecated in favor of ItemSearch.all_items and ItemSearch.all_items_as_dict to bring them in line with the iterator versions?
  2. Decide on the behavior when an items or items_as_dicts iterator gets to max_items. Should it:
    • error?
    • warn?
    • something else?
  3. Bring Client.get_all_items in line with ItemSearch.get_all_items (right now Client.get_all_items returns an iterator, which is different than ItemSearch.get_all_items)

I don't know if pystac-client has a formal policy, but pandas uses either a FutureWarning for deprecations that will affect user-code, and DeprecationWarning for code that will likely affect library code (which is then upgraded to a FutureWarning in a future release).

There isn't a formal policy. I don't know if I agree that FutureWarning is appropriate in this case. From https://docs.python.org/3/library/exceptions.html#FutureWarning:

when those warnings are intended for end users of applications that are written in Python

To me, the changes implemented in #206 are API changes, not application changes. To me, an "application change" would be a change to CLI behavior (e.g.). The docs for DeprecationWarning state:

when those warnings are intended for other Python developers

While notebooks blur the line between "Python developers" and "application users", you're still writing Python code in a notebook, so I'd say those users are "Python developers." Very happy to be argued out of this position, especially from the "what people actually see via the default warning filters" side of things.

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jul 8, 2022

Decide on the behavior when an items or items_as_dicts iterator gets to max_items. Should it:

xref #258, but one thing I didn't appreciate about the 0.4.0 release was that the default max_items behavior changed. Regardless of the behavior when max_items is reached, I think the default should be changed back to None.

And if users are explicitly saying max_items=100, then it's perfectly reasonable to (silently) return the 100 items even if there are (maybe) additional items in the result set. IMO there's no need for a warning or error if users are being explicit.

TomAugspurger pushed a commit to microsoft/planetary-computer-containers that referenced this issue Jul 14, 2022
@TomAugspurger
Copy link
Collaborator Author

@gadomski, @philvarner do either of you have thoughts on the max_items change (leaving aside the other deprecations)?

I'm happy to submit a PR updating the behavior of max_items to match my expected behavior in #237 (comment).

@gadomski
Copy link
Member

@gadomski, @philvarner do either of you have thoughts on the max_items change (leaving aside the other deprecations)?

I'm still of two minds about which I prefer, but since you have a concrete use-case I'd say a PR would be good, and we can use that PR's discussion to come to a decision (I expect I'll be 👍 because you've got a use-case :-) ).

I'll also open PR(s) for my other two proposed changes, and we can use those for discussion on those (since they're a little separate):

Un-deprecate ItemSearch.get_all_items and ItemSearch.get_all_items_as_dict

and

Bring Client.get_all_items in line with ItemSearch.get_all_items (right now Client.get_all_items returns an iterator, which is different than ItemSearch.get_all_items)

TomAugspurger pushed a commit to TomAugspurger/pystac-client that referenced this issue Jul 25, 2022
xref stac-utils#237 (comment)

Doesn't close that issue, which also deals with the replacement for the
deprecated methods. This just reverts the change to the default
`max_items` to return all matching items by default.
@philvarner
Copy link
Collaborator

I think what happened here is we (mostly "I") made two distinct changes that didn't work well together.

The overall goal was to change the default behavior seen in get_all_items so that, by default, it wouldn't, for example, try to retrieve millions of items and put them all into memory before returning a value to the caller.

I would propose two changes:

  1. set max_items back to None, but it only applies to the deprecated ItemCollection or List returning methods. Keep those methods deprecated.
  2. items() (and any other iterator methods) do not have max_items applied to them, and they just keep iterating. If a user only wants a certain number of items, they should (pythonically) islice the iterator rather than using our proprietary parameters (e.g., max_items)

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jul 26, 2022

Thanks for the context @philvarner!

The overall goal was to change the default behavior seen in get_all_items so that, by default, it wouldn't, for example, try to retrieve millions of items and put them all into memory before returning a value to the caller.

I don't think I agree with this goal: it goes against what I would expect (and presumably what @duckontheweb or @matthewhanson expected when they initially wrote this). If I execute a query I expect to get back all the matching items. Having the ability to easily limit the result set is great, but is a surprising default (to me).

If we're worried about the UX of returning a potentially large result set, then there are perhaps other ways to indicate that to the user (logging progress or using tqdm come to mind).

If we want to take inspiration from other, perhaps similar contexts, then APIs like https://peps.python.org/pep-0249/ and libraries like https://www.psycopg.org/docs/ come to mind.

In that SQL context, the APIs provide separate methods fetchone, fetchmany, and fetchall (fetchall being similar to get_all_items()). The cursor objects are also iterable, returning all the objects matched by the query. And the maxinum number of results is specified within the query itself using SQL LIMIT clause (I suppose the STAC API doesn't have something similar, so this must be done client side).

So the analogy to pystac-client would be:

  1. max_items instead of embedding a limit in the query
  2. get_all_items instead of fetchall
  3. The default behavior of iterating over a search result (search.items()) would return all matching items.

they should (pythonically) islice the iterator rather than using our proprietary parameters

It's worth emphasizing that this library will often be used interactively, and that should inform the design. Doing catalog.search(..., max_items=100) is easier than importing itertools and calling islice on the result.

I'd personally find max_items only applying to methods like get_all_items() but not .items() surprising.

@philvarner
Copy link
Collaborator

I think fetchall is not a good design. A call to catalog.search().get_all_items() to PC would likely never complete -- it would block indefinitely until either the server returned an error or memory was exhausted. I think having a method with a default behavior like this not an ideal design choice, where there are better alternatives.

I forgot max_items was in search(...), so I could see instead moving it to the items method(s), which is probably where it should have been anyway, which would get rid of the need for islice:

list_of_items = list(catalog.search(...).items(max_items=100))

get_all_items is the same as list(catalog.search(...).items()), so I don't think there's a need for both if max_items is reverted to default to None.

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jul 26, 2022

so I could see instead moving it to the items method(s)

Just to confirm, you're proposing adding a new parameter to .items() and friends with a default of max_items=None, implying that all the matching items are returned? As long as the default is to return all matching items, then I don't have a preference where the user specifies it.

get_all_items is the same as list(catalog.search(...).items()), so I don't think there's a need for both if max_items is reverted to default to None.

It's not quite the same. get_all_items returns a pystac.ItemCollection rather than a plain list. Some downstream users (microsoft/planetary-computer-sdk-for-python#38, I think places in stackstac) handle ItemCollections differently than list[pystac.Item].

@philvarner
Copy link
Collaborator

  1. Correct.
  2. Ah, I forgot get_all_items returned an IC rather than List[Item]. In that case, pystac.ItemCollection(catalog.search(...).items()) might work? How do they handle them differently?

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jul 26, 2022

Actually, did the return type of get_all_items change in the most recent pystac-client too? At least in prior versions it returned an ItemCollection. Now I'm not sure just from looking at the source code.

pystac.ItemCollection(catalog.search(...).items())

Close, but to avoid #49 you'd also need to pass clone_items=False. I'm worried that users will miss that and suffer poorer performance than what we have today (at least in the last release) from get_all_items() which we know does the right thing.

In the case of planetary_computer we're able to sign objects with known types (ItemCollection, Item, etc.). We could add support for signing sequences of things if necessary.

@philvarner
Copy link
Collaborator

get_all_items still returns an ItemCollection, doesn't it? Maybe I'm looking at the wrong code too.

So something like:

catalog.search(...).item_collection(max_items=None) # required keyword arg

which does apply clone_items=False. get_all_items would still be deprecated in favor of this method, that's more clearly named and explicit in it's danger.

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jul 27, 2022

Yeah, I was confused. ItemSearch.get_all_items() does return an ItemCollection in the latest pystac-client. I was looking at pystac_client.Client.get_all_items().

get_all_items would still be deprecated in favor of this method [.item_collection()], that's more clearly named

Makes sense. I can take care of that PR.


I really don't think that max_items should be a required keyword. I think the previous behavior of (by default) returning all the items was the right one. I suppose we have different values on what the right thing to do here is, but I'll try to re-state the case.

If I make a .search(collections=..., datetime=..., intersects=...) I really, really expect that search to return everything matching the conditions (regardless of how I consume it). That just matches all of my mental models for what's going on here, and I'd be very surprised to (silently!) drop items matching my query. It's so surprising I'll update essentially every use of pystac-client in the Planetary Computer docs to always set max_items if / when we update pystac-client (I should probably update them today, for users reading our docs and using the latest pystac-client).


Last thing regarding the API Client.search(..., max_items=10) vs. Client.search(...).items(max_items=10) I don't have a strong preference. I don't imagine people are repeatedly using .items() or using the various .get_* methods from the same search object, so it's the same amount of typing either way. My question would be whether it's worth the churn to go through a deprecation cycle for a new API. The proposed new .items(max_items=....) doesn't seems too much better to me.

That said, perhaps the fact that other parameters to client.search(...) map to the STAC API, while max_items doesn't is sufficient motivation to do the deprecation?

TomAugspurger pushed a commit to TomAugspurger/pystac-client that referenced this issue Jul 27, 2022
Adds an ItemSearch.item_collection method as a replacement for
the deprecated `get_all_items()`.

A few other notes:

1. I haven't implemented any of the `max_items` changes being worked
   out in stac-utils#237. That
   can be done before / after this PR once things are ironed out
2. I changed the `get_all_items` method to use Sphinx's builtin
   `deprecated` directive.
3. I updated the tests (a bit)

I'd like to enable pytest's `-W error` flag to fail on warnings. But
it looks like the `cli` is using the (deprecated) `get_all_items_as_dict`
method. I haven't touched that in this PR, but I don't think the CLI
should be using a deprecated function.
TomAugspurger pushed a commit to TomAugspurger/pystac-client that referenced this issue Jul 27, 2022
Adds an ItemSearch.item_collection method as a replacement for
the deprecated `get_all_items()`.

A few other notes:

1. I haven't implemented any of the `max_items` changes being worked
   out in stac-utils#237. That
   can be done before / after this PR once things are ironed out
2. I changed the `get_all_items` method to use Sphinx's builtin
   `deprecated` directive.
3. I updated the tests (a bit)

I'd like to enable pytest's `-W error` flag to fail on warnings. But
it looks like the `cli` is using the (deprecated) `get_all_items_as_dict`
method. I haven't touched that in this PR, but I don't think the CLI
should be using a deprecated function.
@lossyrob
Copy link
Member

I really don't think that max_items should be a required keyword. I think the previous behavior of (by default) returning all the items was the right one. I suppose we have different values on what the right thing to do here is, but I'll try to re-state the case.

Voicing strong support for this perspective. The previous behavior matches what users expect; making things unexpected to real use cases in order to protect users from themselves isn't good design pattern to me.

Unless there are others who want to voice strong support that we need required protections in this case, I suggest that we move forward with Tom's suggested implementation and revert to the original max_items behavior.

@duckontheweb
Copy link
Collaborator

I’m also in favor of reverting to the old behavior of not limiting the number of Items returned. I got bitten by this recently when I was aggregating response geometries and it took me a long time to figure out why they were not adding up. I think the possibility of using up a bunch of resources by blindly calling this on a large response set it just one of the dangers of working with big data and is best managed by the user.

@philvarner
Copy link
Collaborator

giphy

@philvarner
Copy link
Collaborator

(1) use of ItemCollection in odcstac and stackstac

Earlier Tom said:

Over in https://github.com/microsoft/planetarycomputerexamples we use search.get_all_items() to get a single ItemCollection, which is then handed off to stackstac or odc.stac. Our use case requires having a concrete sequence of items ahead of time, not an iterator / iterable. What's the recommended replacement API? Something like

The odc.stac.load method takes an Interable[Item] (https://github.com/opendatacube/odc-stac/blob/1db2b1a7a10fb1a1fbd1bfee0a44b50fee22e911/odc/stac/_load.py#L203):

def load(
    items: Iterable[pystac.item.Item],

An ItemCollection is also an Interable[Item] because it implements __iter__ to return Items

I checked and there aren't any uses of ItemCollection in odc_stac.

In stackstac, the stack() method accepts a lot of types, but not Sequence[pystac.Item]:

def stack(
    items: Union[ItemCollectionIsh, ItemIsh],
...

ItemSequence = Sequence[ItemDict]

ItemIsh = Union[SatstacItem, PystacItem, ItemDict]
ItemCollectionIsh = Union[
    SatstacItemCollection, PystacCatalog, PystacItemCollection, ItemSequence
]

I've filed an issue requesting support for that:

(2) ItemCollection mangles Items in default constructor

Filed an issue to change the default to cloning rather than not, since correct behavior is better than speed by default, and this has bitten nearly everyone who's used it.

(3) moving max_items out of search() method

One other issue we have with the method is that the search() method has a bunch of parameters that configure things that aren't actually part of the search query. For example, max_items is a behavior on getting the results of the search rather than the query. We should move those out into the accessor methods rather than having them in search

philvarner pushed a commit that referenced this issue Jul 28, 2022
Adds an ItemSearch.item_collection method as a replacement for
the deprecated `get_all_items()`.

A few other notes:

1. I haven't implemented any of the `max_items` changes being worked
   out in #237. That
   can be done before / after this PR once things are ironed out
2. I changed the `get_all_items` method to use Sphinx's builtin
   `deprecated` directive.
3. I updated the tests (a bit)

I'd like to enable pytest's `-W error` flag to fail on warnings. But
it looks like the `cli` is using the (deprecated) `get_all_items_as_dict`
method. I haven't touched that in this PR, but I don't think the CLI
should be using a deprecated function.
@philvarner
Copy link
Collaborator

Filed an issue to default max_items back to None

#278

gadomski pushed a commit that referenced this issue Aug 1, 2022
* Restore default max_items behavior

xref #237 (comment)

Doesn't close that issue, which also deals with the replacement for the
deprecated methods. This just reverts the change to the default
`max_items` to return all matching items by default.

* vcr

* alias
@TomAugspurger
Copy link
Collaborator Author

I think everything here is addressed on main. Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants