Best Practice: requestor pays #896

cholmes · 2020-09-10T22:07:56Z

As suggested on gitter by @matthewhanson - it'd be good to have a best practice on URL's that are 'requestor pays'. We should capture these thoughts and put them in best practices.

'For requester pays URLs I’ve been using the s3 URL, e.g., s3://syncarto-data-rp/stac/naip/catalog.json
Since the http URL is useless on it’s own unless you sign it, so just working with the s3 URLs directly (with AWS CLI or boto3) I think is easier. Plus you can use PySTAC to support s3 reads/writes.
If public, then I use the actual http URL.
This might be a good thing to add to best practice'

'Even better might be to keep the STAC metadata in a different, and completely public, bucket that isn’t requester pays. Normally I like the data alongside the STAC Items, but I think it’s better if it’s public. That way you can use tools like STAC browser, PySTAC, without authentication for just the metadata.'

m-mohr · 2020-09-11T11:47:14Z

Ist this just for S3? Would a person buying data at Planet also be "requester pays"? Or how exactly is that defined outside of S3?

davidraleigh · 2020-09-11T20:20:21Z

This is a field I have on the grpc STAC version of assets:
https://geo-grpc.github.io/api/#epl.protobuf.v1.Asset

It's also used in google cloud:
https://cloud.google.com/storage/docs/requester-pays

And I imagine it also exists in Azure

jflasher · 2020-09-15T13:48:15Z

I think it'd definitely be good to have requester pays called out in the metadata as it presents a technical and financial difference in how you access the data. I have tried to create the signatures myself for use with straight HTTP requests, but always fall back on the available SDKs. Also, at least for AWS, there are two costs incurred with requester pays, the egress and a per-request fee. The per-request fee is generally very small compared to egress cost, but this is not always the case (specifically in cases of listing the bucket contents) and likely should be mentioned for completeness.

philvarner · 2020-09-15T14:08:08Z

S3 and Google have RP, Azure apparently does not.

Overall, I think these concepts are cross-provider (e.g., not only S3) and useful enough to warrant an extension.

I like some of the fields in @davidraleigh 's link -- a few comments on them:

cloud_platform (Google cloud, Azure, etc) -- I think this is good to have. I use the "s3://" scheme in most of my asset hrefs, but I could just as well have used the virtual-host-style https uris. I assume this can be done with Google storage instead of "gs://", and Azure blob storage only uses https uris (e.g., https://myaccount.blob.core.windows.net/mycontainer/myblob)
bucket_region - I only know about S3, and there's no way to find out what region an object is in with just the "s3://" uri, without making a request to the us-east-1 and seeing where you get redirected.
bucket and object_path - duplicate what's in the href, but may be useful so you don't have to parse it?
requester_pays - would definitely be useful to have this. In most of our code, we just set it on by default, as it has no effect if the bucket isn't RP, but if you make a request to an RP bucket and don't set the RP flag in the request, you get a generic AccessDenied message, and it's not apparent that you needed to set the RP flag. I could imagine an end-user client like QGIS wanting to pop up a warning stating "You're going to incur cost to download this, continue?" rather than just pulling a lot of data without the user really knowing that's happening (e.g., not as apparent as when using the awscli)

matthewhanson · 2020-09-15T16:19:57Z

A couple years ago we talked about "storage profiles" for STAC to describe some of these things, but nothing ever came of it.

I think a "cloud_storage" extension is warranted (or maybe just "cloud"). It can be set in Item properties, but could also be set per asset using the general Asset specific metadata rule:

Fields:

platform: aws, azure, google, etc
region: It would be useful to be able to have region, since as @philvarner points out you have to make a request to find out region (there's a get_region function in boto3).
requester_pays: true/false?

I'd avoid putting in bucket and object path, converting between s3 and http URLs is easy enough, and would be good to avoid dupliication.

davidraleigh · 2020-09-15T19:55:51Z

We use STAC a lot internally, so object_path and bucket are useful to those internal users who have access permissions to use them, but for customers there is an href that isn't constructed from bucket + object_path.

matthewhanson · 2020-09-17T15:48:43Z

@davidraleigh Ah, so this is really a case where you might have multiple URLs to the same assets.
We've run into this where we use s3 URLs, but for external users we have cloudfront URLs. We've been handling that just be translating the URLs in a service built on top of the normal STAC API.

I could see an "alternate_hrefs" array in assets for something like this, if we wanted it to be more general. This would also be able to represent actual data mirrors.

davidraleigh · 2020-09-17T21:16:56Z

I'm stumped for which is the clearest method. I love the object_path and bucket, because I think of everything as having a bucket. But I could see something like alternate_hrefs as not being too attached to the whole bucket cloud storage paradigm.

cholmes · 2021-01-04T16:49:16Z

Two things here:

Use non-http url's
Alternate URL's

We want to provide real recommendations for next release.

cholmes · 2021-01-19T23:31:48Z

@matthewhanson - I can take on the work of writing this up, but need a clearer idea of what exactly to say. Others please sound in as well - I'm happy to try to write this up, but I don't have deep experience with stac & cloud locations.

I noted a bit from our call. My questions:

For use of non-http urls - do we want to call this out explicitly in the main spec? That you are allowed to use them? And then do we want to make it a recommendation that if you are using requestor pays you should use those? And not use http?
Alternate urls - do we want to add a specific field for this? An extension?
Do we still want a requestor pays best practice? That says if you are doing requestor pays then a s3:// style url - http should be used.
Do we want a cloud_storage extension? With platform / region and requestor pays?

jflasher · 2021-01-20T16:07:29Z

In addition to the fields mentioned above, I think having something like storage_class would also be useful. I think we'll see datasets in the future that have a mix of warm and cold storage. You'd still want the metadata for the data in cold storage but it'd be beneficial to know that the data will not be immediately available.

Talking myself out of the above, data generally gets brought out of cold storage for some period of time and then returned. So its storage_class is not constant. If the STAC entry isn't updated when it's brought out of cold storage, this field likely becomes less useful. The likely pattern without this entry (or if it's not updated) is that you'd 1) request the object, 2) get a message that says it's not available and then 3) follow some other step to bring it out of cold storage. An update field here likely just lets you skip step 1.

Also, think it would definitely be good to include region. I presume we'd want to use the platform-specific region designations? That'll be less meaningful to someone using a different platform but a) it's likely not of interest to them anyways and b) doesn't seem like STAC's role to somehow unify those designations.

cholmes · 2021-01-20T16:43:10Z

Storage class does seem like a good option to have. Is there a generic / cross cloud way to refer to them? I'm not deep on the options and how they map across clouds. Perhaps we'd have a little table that maps the generic name to the names on each of the major services.

Region I agree we'd need platform specific designations.

If anyone has time to write up a PR, even a draft one, on the extension that'd be much appreciated, as I've got a backlog for 1.0-RC1 stuff. I guess as an extension this doesn't need to be done by RC1, but it'd be nice to have.

jflasher · 2021-01-20T18:56:34Z

Thinking of this a little more, maybe it's important to get to the point of tracking storage class? So maybe something like immediately_available:T/F or retrieval_needed:T/F. While storage_class seems useful, I feel like it may be putting some effort on the user to figure out what a given storage class means.

davidraleigh · 2021-01-20T19:03:22Z

@cholmes what's the timeline for writing up a PR? I'm a little bogged down the next week and a half, but I could put more thought into it after that.

I would like a bitmask enum that I can use on the Asset and StacItem level, that has provider storage level information. I could search for all data that's currently on nearline and prepare to move it to coldline (using GCP terms for a minute). We have STAC items in multiple cloud providers, so a bitmask would allow me to look at what's nearline in AWS and coldline in GCP. And then on the asset level itself I could use the enum to define the status of the item.

m-mohr · 2021-01-20T19:06:32Z

FYI: In STAC Index I've three classes of availability: public (accessible without any authentication), protected (authentication required for data access, but metadata accessible to all) and private (authentication required for all and/or only accessible to some groups, e.g. you must sign a contract first, you must be living in a specific county (geo-fenced), or be working for a federal government).

I presume we'd want to use the platform-specific region designations?

Yes, I think so, too. You usually can't search for that anyway because it's hidden in assets.

cholmes · 2021-01-20T19:14:04Z

So maybe something like immediately_available:T/F or retrieval_needed

I like the direction to more generic. But once we get to here I start thinking about the general use case of 'ordering' data - providers (like Planet) generate the geotiff's on demand. It'd be great to cover that too, and retrieval_needed vs immediately_available seem like they'd work. It probably wouldn't be a 'cloud storage' extension, but perhaps a pair of extensions - one on 'asset availability' or something like that, and one on cloud storage - with regions and requestor pays. The asset availability would also hopefully cover @m-mohr's public/protected/private as well.

@cholmes what's the timeline for writing up a PR?

@davidraleigh - I doubt I'll get to it in the next two weeks, so if you could do it within that time frame that'd be great.

m-mohr · 2021-01-21T11:14:32Z

By the way, there are related issues for accessing and ordering data: #836 and #891

davidraleigh · 2021-02-08T18:00:19Z

@cholmes I can make an attempt at a pull request for this week

cholmes · 2021-02-08T23:45:38Z

@davidraleigh - awesome! Be warned, we are going to move most of the extensions out of the core repo soon, see #946 But feel free to make a PR here, it'll just probably be applied to another repo.

cholmes · 2021-02-23T18:57:01Z

Circling back on this - we've got a lot of great energy on the cloud storage extension. But I don't think we need that for 1.0.0, as STAC works fine without it, and it'll be a nice addition to have in an extension (I'm not set on that, but would want to hear a good argument).

But what do we want to actually say in the spec itself? Should we call out the use of s3:// style url's in a best practice? And say that those are recommended to be used when data is requestor pays? And to also recommend that people don't put their STAC metadata in requestor pays buckets?

cholmes · 2021-02-25T16:41:08Z

For main best practices:

put your metadata in non-requestor pays
use storage extension
If there is not publicly available http url then use a url that is the right protocol for the file (s3:// gs://)

cholmes · 2021-03-02T16:50:40Z

Closing this, though note if we do get a storage extension soon then we should link to it from this best practice.

cholmes added this to the 1.0.0-beta.3 milestone Sep 10, 2020

cholmes added the prio: must-have required for release associated with label Jan 4, 2021

m-mohr mentioned this issue Jan 20, 2021

Further describe access levels stac-utils/stac-index#13

Open

davidraleigh mentioned this issue Feb 16, 2021

requester pays and other storage details in Asset and Item #991

Closed

4 tasks

m-mohr linked a pull request Feb 19, 2021 that will close this issue

requester pays and other storage details in Asset and Item #991

Closed

4 tasks

cholmes added the discussion needed label Feb 25, 2021

cholmes self-assigned this Feb 25, 2021

cholmes removed the discussion needed label Mar 2, 2021

cholmes mentioned this issue Mar 2, 2021

Requester Pays Best Practice #1021

Merged

3 tasks

cholmes closed this as completed Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practice: requestor pays #896

Best Practice: requestor pays #896

cholmes commented Sep 10, 2020

m-mohr commented Sep 11, 2020

davidraleigh commented Sep 11, 2020 •

edited

Loading

jflasher commented Sep 15, 2020

philvarner commented Sep 15, 2020 •

edited

Loading

matthewhanson commented Sep 15, 2020

davidraleigh commented Sep 15, 2020 •

edited

Loading

matthewhanson commented Sep 17, 2020

davidraleigh commented Sep 17, 2020

cholmes commented Jan 4, 2021

cholmes commented Jan 19, 2021

jflasher commented Jan 20, 2021

cholmes commented Jan 20, 2021

jflasher commented Jan 20, 2021

davidraleigh commented Jan 20, 2021

m-mohr commented Jan 20, 2021 •

edited

Loading

cholmes commented Jan 20, 2021

m-mohr commented Jan 21, 2021

davidraleigh commented Feb 8, 2021

cholmes commented Feb 8, 2021

cholmes commented Feb 23, 2021

cholmes commented Feb 25, 2021 •

edited

Loading

cholmes commented Mar 2, 2021

Best Practice: requestor pays #896

Best Practice: requestor pays #896

Comments

cholmes commented Sep 10, 2020

m-mohr commented Sep 11, 2020

davidraleigh commented Sep 11, 2020 • edited Loading

jflasher commented Sep 15, 2020

philvarner commented Sep 15, 2020 • edited Loading

matthewhanson commented Sep 15, 2020

davidraleigh commented Sep 15, 2020 • edited Loading

matthewhanson commented Sep 17, 2020

davidraleigh commented Sep 17, 2020

cholmes commented Jan 4, 2021

cholmes commented Jan 19, 2021

jflasher commented Jan 20, 2021

cholmes commented Jan 20, 2021

jflasher commented Jan 20, 2021

davidraleigh commented Jan 20, 2021

m-mohr commented Jan 20, 2021 • edited Loading

cholmes commented Jan 20, 2021

m-mohr commented Jan 21, 2021

davidraleigh commented Feb 8, 2021

cholmes commented Feb 8, 2021

cholmes commented Feb 23, 2021

cholmes commented Feb 25, 2021 • edited Loading

cholmes commented Mar 2, 2021

davidraleigh commented Sep 11, 2020 •

edited

Loading

philvarner commented Sep 15, 2020 •

edited

Loading

davidraleigh commented Sep 15, 2020 •

edited

Loading

m-mohr commented Jan 20, 2021 •

edited

Loading

cholmes commented Feb 25, 2021 •

edited

Loading