Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug LS and S2 STAC iteration #50

Merged
merged 19 commits into from
Jan 24, 2022
Merged

Debug LS and S2 STAC iteration #50

merged 19 commits into from
Jan 24, 2022

Conversation

banesullivan
Copy link
Contributor

Follow up to #44

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2021

Codecov Report

Merging #50 (1b13b45) into main (4b2f7b9) will decrease coverage by 0.06%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #50      +/-   ##
==========================================
- Coverage   74.03%   73.96%   -0.07%     
==========================================
  Files          34       34              
  Lines         697      699       +2     
==========================================
+ Hits          516      517       +1     
- Misses        181      182       +1     
Impacted Files Coverage Δ
rgd-watch-client/rgd_watch_client/plugin.py 91.66% <50.00%> (-3.79%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b2f7b9...1b13b45. Read the comment docs.

@banesullivan
Copy link
Contributor Author

Update: we need to reduce our queries because stac-server has some serious limitations.

The problem and the error that we are experiencing (which index.max_result_window controls) is that stac-server is improperly using the elastic search scroll API. This seems like a serious flaw in stac-server's implementation and a significant limitation of using stac-server.

Specifically, from the Elastic Search API docs, there is this note:

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

ref https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#scroll-search-results

This error is indeed missing functionality / a bug in stac-server itself and has already been filed with stac-server here: stac-utils/stac-server#111

To work around this until stac-server can resolve the issue is partition the data using a filter on the DateTime (e.g. per day) with https://api.stacspec.org/v1.0.0-beta.4/item-search/#operation/getItemSearch

@banesullivan banesullivan marked this pull request as ready for review December 17, 2021 23:39
@banesullivan
Copy link
Contributor Author

I've updated the LS and S2 ingest scripts to use pystac-client - we'll see it's able to iterate over the whole collection in a reasonable amount of time... if not, will try to parallelize in a follow-up PR as we parallelize the POST requests to RGD

from watch_helpers import post_stac_items_from_server

host_url = 'https://landsatlook.usgs.gov/stac-server/collections/landsat-c2l1/items'
min_date = datetime(2013, 1, 1) # Arbitrarily chosen
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-bernstein, any thoughts on a minimum ingest date for Landsat?

I'm waiting to see how long it takes to ingest, so assuming the last 8 years of data doesn't take more than an hour or so to ingest, there shouldn't be much of a limitation on our end

@banesullivan
Copy link
Contributor Author

banesullivan commented Dec 18, 2021

stac-server's response times are proving to be slow. We're seeing ~5 seconds to retrieve a single day's items. If we want to iterate over the last 5 years (arbitrarily chosen but we'll def want more data than that), it would take over 2.5 hours (365 * 5 * 5 / 60 / 60 = 2.535) if done serially just to retrieve the items

@banesullivan banesullivan merged commit 951d029 into main Jan 24, 2022
@banesullivan banesullivan deleted the test-ls-s2 branch January 24, 2022 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants