Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness in Item.to_dict(), seemingly from links #546

Closed
TomAugspurger opened this issue Jul 9, 2021 · 20 comments
Closed

Slowness in Item.to_dict(), seemingly from links #546

TomAugspurger opened this issue Jul 9, 2021 · 20 comments
Assignees
Labels
bug Things which are broken
Milestone

Comments

@TomAugspurger
Copy link
Collaborator

I'm trying to diagnose a slowdown I'm observing in pystac.to_dict() is slower. Right now I'm unsure if the slowdown is due to a change in pystac or in our STAC endpoint. Anyway, if I call clear_links() prior to to_dict() then things are faster (goes from about 1s to 200µs)

In [1]: import pystac

In [2]: item = pystac.Item.from_file("https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20160208T190542_R0
   ...: 13_T10TET")
   ...: %time _ = item.to_dict()
CPU times: user 25 ms, sys: 339 µs, total: 25.4 ms
Wall time: 987 ms

In [3]: item = pystac.Item.from_file("https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20160208T190542_R0
   ...: 13_T10TET")
   ...: item.clear_links()
   ...: %time _ = item.to_dict()
CPU times: user 167 µs, sys: 18 µs, total: 185 µs
Wall time: 192 µs

So a couple questions:

  1. Why is having those links present slowing things down? Glancing at the output of to_dict, I don't see why it would need to make an HTTP call or anything like that, right? Maybe through some chain calling link.get_href()?
  2. Have there been any recent changes that would have introduced this slowdown? We've also been making changes to that STAC endpoint that could have slowed things down. Brief testing shows that RC1 - RC3 all have similar performance on this same endpoint (about 1s to do to_dict()
  3. to_dict() has a parameter include_self_link. Could we add a parameter to exclude all links? For my application, I only need the assets and extension info, so making requests to fetch the link
@TomAugspurger
Copy link
Collaborator Author

All the time is seemingly spent on Item.get_root()

link = item.links[0]
%lprun -f pystac.link.Link.get_href link.get_href()

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   112                                               def get_href(self) -> Optional[str]:
   121                                                   # get the self href
   122         1          8.0      8.0      0.0          if self.is_resolved():
   123                                                       href = cast(pystac.STACObject, self.target).get_self_href()
   124                                                   else:
   125         1         31.0     31.0      0.0              href = cast(Optional[str], self.target)
   126                                           
   127         1    1650663.0 1650663.0    100.0          if href and is_absolute_href(href) and self.owner and self.owner.get_root():

which seemingly does the (slow) I/O

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   187                                               def get_root(self) -> Optional["Catalog_Type"]:
   196         1         15.0     15.0      0.0          root_link = self.get_root_link()
   197         1          1.0      1.0      0.0          if root_link:
   198         1          2.0      2.0      0.0              if not root_link.is_resolved():
   199         1    1651180.0 1651180.0    100.0                  root_link.resolve_stac_object()

I notice that the (slow) conditional above is followed by another conditional at

pystac/pystac/link.py

Lines 134 to 137 in 12eff70

if root and root.is_relative() and self.rel in rel_links:
owner_href = self.owner.get_self_href()
if owner_href is not None:
href = make_relative_href(href, owner_href)
, and only if that's true do we actually do anything to the output. In our case, https://planetarycomputer-staging.microsoft.com/api/stac/v1/ is not relative, so we don't actually do anything. I'm not sure how the item / link could know that without actually reading the Catalog though...

lossyrob added a commit that referenced this issue Jul 9, 2021
There was existing unused arguments for seting the root on
to_dict. This commit implements and tests that the root is set to the
parameter if present, and adds the parameter to ItemCollection, which
itself does not have a root but instead passes the root onto each of
the Items parsed during the to_dict call.

Related to #546
@lossyrob
Copy link
Member

lossyrob commented Jul 9, 2021

For this test:

import pystac
from pystac_client import Client
import planetary_computer as pc

from time import perf_counter

stac = Client.open("https://planetarycomputer-staging.microsoft.com/api/stac/v1")

area_of_interest = {
    "type": "Polygon",
    "coordinates": [
        [
            [-122.27508544921875, 47.54687159892238],
            [-121.96128845214844, 47.54687159892238],
            [-121.96128845214844, 47.745787772920934],
            [-122.27508544921875, 47.745787772920934],
            [-122.27508544921875, 47.54687159892238],
        ]
    ],
}

search = stac.search(
    intersects=area_of_interest,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    query={
        'eo:cloud_cover': { 'le': 25 }
    },
    limit=500,  # fetch items in batches of 500
)
t1 = perf_counter()
items = list(search.get_items())
print(f"Found {len(items)} items")
t2 = perf_counter()
signed_items = pystac.ItemCollection(
    [pc.sign(item) for item in items]
)
t3 = perf_counter()
dicts = [item.to_dict() for item in items]
t4 = perf_counter()

print(f"Search took {t2-t1} seconds")
print(f"Sign took {t3-t2} seconds")
print(f"to_dict took {t4-t2} seconds")

Without the above PRs:

Search took 1.8861020999975153 seconds
Sign took 2.654093600001943 seconds
to_dict took 123.35427039999922 seconds

With them:

Search took 2.9420552999981737 seconds
Sign took 1.4421263000003819 seconds
to_dict took 1.4572951000009198 seconds

@TomAugspurger
Copy link
Collaborator Author

Fixed by #549 + stac-utils/pystac-client#72 (I think).

@gjoseph92
Copy link

I'm still experiencing extreme to_dict() slowness for the same reason (get_root()) with pystac== 1.1.0 and pystac_client== 0.2.0.
Screen Shot 2021-08-30 at 1 43 50 PM

Is that expected? I thought both of those changes had been released now.

FYI @TomAugspurger I was going to release a new stackstac version today with your and @scottyhq's pystac updates, but I'm going to hold off until this is addressed because waiting multiple minutes for stackstac.stack is pretty poor UX.

@TomAugspurger
Copy link
Collaborator Author

You're correct, both of these changes should be available in the versions of pystac & pystac-client you posted.

With the https://planetarycomputer.microsoft.com/api/stac/v1 endpoint, things are better, but still not great:

%%time
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
items = catalog.search(
    intersects=dict(type="Point", coordinates=[-106, 35.7]),
    collections=["sentinel-2-l2a"],
    datetime="2019-01-01/2020-01-01"
).get_all_items()

CPU times: user 396 ms, sys: 19.6 ms, total: 415 ms
Wall time: 1.36 s

(that gives 281 items instead of 294, dunno why)

and

%%time
dict_items = [item.to_dict() for item in items]

CPU times: user 3.95 s, sys: 243 ms, total: 4.2 s
Wall time: 19.2 s

Most of the time seems to be spent in item.to_links via item.to_dict() (this is with an item from the planterycomputer endpoint)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   288                                               def to_dict(self, include_self_link: bool = True) -> Dict[str, Any]:
   289         1          3.0      3.0      0.3          links = self.links
   290         1          2.0      2.0      0.2          if not include_self_link:
   291                                                       links = [x for x in links if x.rel != pystac.RelType.SELF]
   292                                           
   293         1        224.0    224.0     25.2          assets = {k: v.to_dict() for k, v in self.assets.items()}
   294                                           
   295         1          2.0      2.0      0.2          if self.datetime is not None:
   296         1         43.0     43.0      4.8              self.properties["datetime"] = datetime_to_str(self.datetime)
   297                                                   else:
   298                                                       self.properties["datetime"] = None
   299                                           
   300         1          4.0      4.0      0.4          d: Dict[str, Any] = {
   301         1          4.0      4.0      0.4              "type": "Feature",
   302         1         42.0     42.0      4.7              "stac_version": pystac.get_stac_version(),
   303         1          5.0      5.0      0.6              "id": self.id,
   304         1          1.0      1.0      0.1              "properties": self.properties,
   305         1          1.0      1.0      0.1              "geometry": self.geometry,
   306         1        531.0    531.0     59.7              "links": [link.to_dict() for link in links],
   307         1          3.0      3.0      0.3              "assets": assets,
   308                                                   }
   309                                           
   310         1          4.0      4.0      0.4          if self.bbox is not None:
   311         1          3.0      3.0      0.3              d["bbox"] = self.bbox
   312                                           
   313         1          4.0      4.0      0.4          if self.stac_extensions is not None:
   314         1          1.0      1.0      0.1              d["stac_extensions"] = self.stac_extensions
   315                                           
   316         1          2.0      2.0      0.2          if self.collection_id:
   317         1          1.0      1.0      0.1              d["collection"] = self.collection_id
   318                                           
   319         2          5.0      2.5      0.6          for key in self.extra_fields:
   320         1          4.0      4.0      0.4              d[key] = self.extra_fields[key]
   321                                           
   322         1          1.0      1.0      0.1          return d

@TomAugspurger
Copy link
Collaborator Author

It's definitely related to link resolution though. If you clear the links on the earth-search endpoint, things look just fine in to_dict()

items2 = []
for item in items:
    item = item.clone()
    item.clear_links()
    items2.append(item)
    
ic2 = pystac.ItemCollection(items2)
%%time dict_items = [item.to_dict() for item in a_items[:20]]

CPU times: user 13.4 ms, sys: 3.77 ms, total: 17.2 ms
Wall time: 16.6 ms

I wonder what's best here. At a minimum, there's probably some work to do in pystac to speed up to_dict on links if possible. But since stackstac doesn't use links at all (I think), should it clear the links on pystac.ItemCollection / List[pystac.Item] inputs (at the cost of cloning the items)?

@gjoseph92
Copy link

Yeah, I was thinking the same. Clearing links would be reasonable now, but I'm also a little bit worried about the Asset.to_dict() time from your profile.

I haven't looked yet but I'm curious

  1. What links is pystac resolving?
  2. Why does it need to be resolved? (I would have thought these collections used absolute references, at least for assets)
  3. Is it a different link every time, or the same thing (root collection?) that could be cached?

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Aug 30, 2021

All good questions, and I don't have a good sense for any of them, but I can try!

  1. At a minimum, I think links to the root collection and self are common (required?)
  2. I think it can be either relative or absolute. I think all of ours are absolute though (maybe that could be a bit set on Item / ItemCollection?)
  3. The root collection / catalog will be common across all items, so there's a potential for an optimization there (if you're willing to cache that somewhere)

I made a snakeviz profile at https://gistcdn.githack.com/TomAugspurger/66c6084c0898c1c713e54163fc7769d2/raw/2282976c21185c38731d162878aa033075263821/item_to_dict_static.html (LMK if you need help reading that. https://jiffyclub.github.io/snakeviz/#interpreting-results has an explanation)

import pystac, pystac_client
%load_ext snakeviz

url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20191230T174729_R098_T13SDV_20201003T112518"
item = pystac.read_file(url)

%snakeviz item.to_dict()

@duckontheweb do you or anyone else have time / interest to look into this?

@duckontheweb
Copy link
Contributor

@TomAugspurger @gjoseph92 Thanks for bringing this up. I can take a look sometime in the next couple of days.

I've noticed that issue with slowness in Link.to_dict before and I think caching that get_root call should make things much better. I agree with @gjoseph92 that the Asset.to_dict slowness is concerning because that method should just be assigning values to a dictionary.

One thing @lossyrob and I have discussed a bit is moving towards having all JSON-like classes (Asset, Link, etc.) have an underlying dictionary for all of their fields rather than storing them in instance attributes and then converting them to a dictionary. In that case, our from_dict and to_dict methods could be much faster (since they would either be returning or copying dicitonaries) at the expense of slightly slower attribute access (since we would be using property accessors instead of instance attributes). I may take a crack at implementing this for the Asset class to see if it helps with this performance issue.

@duckontheweb
Copy link
Contributor

Regarding the slow Link.to_dict method:

That method makes a call to Item.get_root here. If that Item's root link has not been resolved, then this results in a network request to resolve that link to a Catalog. Since pystac_client does not pass a root argument to ItemCollection in the get_all_items method none of the Items have resolved root links, so a new network request is made for each one.

I think the best solution here is to pass root=self.client when instantiating ItemCollection as is done in get_item_collections.

cc: @matthewhanson

@lossyrob
Copy link
Member

lossyrob commented Sep 2, 2021

... I think caching that get_root call should make things much better.

Caching the get_root call is tough, because the mechanism for caching in PySTAC is centered around a root. The root has a resolution cache, which can resolve links via calls like this.

However, when no root is set, there's no caching structure in place and reading the same href over and over will be very expensive.

As you said @duckontheweb, a good solution to this particular issue is to ensure the root is set on the Items via the ItemCollection.

However, this seems to keep popping up as an inefficiency, and is hard to diagnose. Perhaps we should rethink keeping caching to a per-root basis, and instead make caching based on a singleton? This may only make sense for the HREF caching - the caching by object IDs that happens off of the root is necessary for proper object resolution (e.g. in cases where links point to the same object read at different times, but should represent the same target object in memory). There's probably some cache invalidation headaches that a global HREF cache would cause, but it might be worth removing these types of performance issues in a way that doesn't necessitate users of the library to always remember to pass in root.

@TomAugspurger
Copy link
Collaborator Author

I think the best solution here is to pass root=self.client when instantiating ItemCollection as is done in get_item_collections.

stac-utils/pystac-client#90 sets that.

gjoseph92 added a commit to gjoseph92/stackstac that referenced this issue Oct 28, 2021
Until stac-utils/pystac#546 is resolved, pystac is very painful to use, so I'd rather not have it be the lead example.

Partially Revert "Fix for Pystac ItemCollections (#69)"

This reverts commit 98809b4.
gjoseph92 added a commit to gjoseph92/stackstac that referenced this issue Oct 28, 2021
Until stac-utils/pystac#546 is resolved, pystac is very painful to use, so I'd rather not have it be the lead example.

Partially Revert "Fix for Pystac ItemCollections (#69)"

This reverts commit 98809b4.
@TomAugspurger
Copy link
Collaborator Author

Just noting that the slowness in #546 (comment) is fixed now, possibly by stac-utils/pystac-client#90.

#546 (comment) has some more thoughts on a way to speed things up even more. Do we want to keep this issue open to track that work? Or should we make a new one dedicated to it?

@gjoseph92
Copy link

Confirmed, it does seem better with pystac-client=0.3.0! Thanks @TomAugspurger.

I feel like this issue could be closed, and those ideas moved to a new one?

@duckontheweb
Copy link
Contributor

I'm +1 for closing this and opening a new issue. #617 starts to implement some of the ideas from #546 (comment), but needs some more work.

@gjoseph92
Copy link

FYI to_dict is still a bit slow because of stac-utils/pystac-client#115

gjoseph92 added a commit to gjoseph92/planetary-computer-sdk-for-python that referenced this issue Oct 28, 2021
Losing the root collection was causing `to_dict()` on signed `ItemCollections` to be extremely slow, like stac-utils/pystac#546. By preserving the root catalog stac-utils/pystac-client#72, it's much faster.
TomAugspurger pushed a commit to microsoft/planetary-computer-sdk-for-python that referenced this issue Oct 29, 2021
Losing the root collection was causing `to_dict()` on signed `ItemCollections` to be extremely slow, like stac-utils/pystac#546. By preserving the root catalog stac-utils/pystac-client#72, it's much faster.
@scottyhq
Copy link

scottyhq commented Nov 12, 2021

Wanted to point out that while things work well with pystac_client search returns, the from_file method is still slow. I don't really understand the root resolution issues. But this affects loading saved search results, example below:

%%time

import pystac_client #0.3.0
import pystac #1.1

URL = "https://earth-search.aws.element84.com/v0"
catalog = pystac_client.Client.open(URL)

results = catalog.search(
    intersects=dict(type="Point", coordinates=[-105.78, 35.79]),
    collections=["sentinel-s2-l2a-cogs"],
    datetime="2010-04-01/2021-12-31",
    limit=1000,
)

print(f"{results.matched()} items found") # 604 items
stac_items = results.get_all_items() 
stac_items.save_object('items.json')
# Wall time: 2.0s
%%time 
items = [item.to_dict() for item in stac_items]
# Wall time: 69.3 ms
%%time 
stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict() for item in stac_items]
# !!!! Wall time: 1min 13s !!!

I feel like the from_file() can do the root setting automatically since root urls are in the search results? Or else that method also needs to have a root kwarg:

def from_file(

%%time
# Awkward workaround
import json
with open('items.json') as f:
    featureCol = json.load(f)
    
e84 = pystac_client.Client.open('https://earth-search.aws.element84.com/v0')
stac_items = pystac.ItemCollection.from_dict(featureCol, preserve_dict=False, root=e84)
items = [item.to_dict() for item in stac_items]
# Wall time: 654 ms

@lossyrob
Copy link
Member

With #663, and using the new flag:

stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict(transform_hrefs=False) for item in stac_items]

this should be as fast as your workaround example. Also ItemCollection.to_dict() will not transform HREFs by default.

@duckontheweb duckontheweb modified the milestones: 1.2.0, 1.4.0 Nov 17, 2021
@duckontheweb
Copy link
Contributor

#663 is included in v1.2.0, but I am leaving this issue open and assigning to a future release milestone so we can continue to explore some of the options outlined in the above comment and in #546 (comment)

@pjhartzell
Copy link
Collaborator

pjhartzell commented Feb 1, 2023

Closing Summary

The issue was slowness in the pystac.Item.to_dict() method. The cause was identified as root link resolution. The problem was compounded when calling to_dict() on each Item in an ItemCollection, in which case the root link was resolved each time. Several fixes were implemented:

There was discussion of additional caching for further performance improvements, but the complexity of the solution seems high. A topic that continues to surface - using a dictionary for all fields in JSON-like classes (Asset, Link, etc.) - was also investigated (see #617) as a performance boost since the to_dict() methods would be returning or copying dictionaries rather than converting classes. However, this change is about more than just performance, e.g., it would enable flexible changes to Item fields, and is probably better addressed in a dedicated issue or RFC.

There are now mechanisms in place that address the original issue. It can be re-opened if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things which are broken
Projects
None yet
Development

No branches or pull requests

7 participants