-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best Practice: requestor pays #896
Comments
Ist this just for S3? Would a person buying data at Planet also be "requester pays"? Or how exactly is that defined outside of S3? |
This is a field I have on the grpc STAC version of assets: It's also used in google cloud: And I imagine it also exists in Azure |
I think it'd definitely be good to have requester pays called out in the metadata as it presents a technical and financial difference in how you access the data. I have tried to create the signatures myself for use with straight HTTP requests, but always fall back on the available SDKs. Also, at least for AWS, there are two costs incurred with requester pays, the egress and a per-request fee. The per-request fee is generally very small compared to egress cost, but this is not always the case (specifically in cases of listing the bucket contents) and likely should be mentioned for completeness. |
S3 and Google have RP, Azure apparently does not. Overall, I think these concepts are cross-provider (e.g., not only S3) and useful enough to warrant an extension. I like some of the fields in @davidraleigh 's link -- a few comments on them:
|
A couple years ago we talked about "storage profiles" for STAC to describe some of these things, but nothing ever came of it. I think a "cloud_storage" extension is warranted (or maybe just "cloud"). It can be set in Item properties, but could also be set per asset using the general Asset specific metadata rule: Fields:
I'd avoid putting in bucket and object path, converting between s3 and http URLs is easy enough, and would be good to avoid dupliication. |
We use STAC a lot internally, so |
@davidraleigh Ah, so this is really a case where you might have multiple URLs to the same assets. I could see an "alternate_hrefs" array in assets for something like this, if we wanted it to be more general. This would also be able to represent actual data mirrors. |
I'm stumped for which is the clearest method. I love the |
Two things here:
We want to provide real recommendations for next release. |
@matthewhanson - I can take on the work of writing this up, but need a clearer idea of what exactly to say. Others please sound in as well - I'm happy to try to write this up, but I don't have deep experience with stac & cloud locations. I noted a bit from our call. My questions:
|
In addition to the fields mentioned above, I think having something like Talking myself out of the above, data generally gets brought out of cold storage for some period of time and then returned. So its Also, think it would definitely be good to include region. I presume we'd want to use the platform-specific region designations? That'll be less meaningful to someone using a different platform but a) it's likely not of interest to them anyways and b) doesn't seem like STAC's role to somehow unify those designations. |
Storage class does seem like a good option to have. Is there a generic / cross cloud way to refer to them? I'm not deep on the options and how they map across clouds. Perhaps we'd have a little table that maps the generic name to the names on each of the major services. Region I agree we'd need platform specific designations. If anyone has time to write up a PR, even a draft one, on the extension that'd be much appreciated, as I've got a backlog for 1.0-RC1 stuff. I guess as an extension this doesn't need to be done by RC1, but it'd be nice to have. |
Thinking of this a little more, maybe it's important to get to the point of tracking storage class? So maybe something like |
@cholmes what's the timeline for writing up a PR? I'm a little bogged down the next week and a half, but I could put more thought into it after that. I would like a bitmask enum that I can use on the Asset and StacItem level, that has provider storage level information. I could search for all data that's currently on nearline and prepare to move it to coldline (using GCP terms for a minute). We have STAC items in multiple cloud providers, so a bitmask would allow me to look at what's nearline in AWS and coldline in GCP. And then on the asset level itself I could use the enum to define the status of the item. |
FYI: In STAC Index I've three classes of availability: public (accessible without any authentication), protected (authentication required for data access, but metadata accessible to all) and private (authentication required for all and/or only accessible to some groups, e.g. you must sign a contract first, you must be living in a specific county (geo-fenced), or be working for a federal government).
Yes, I think so, too. You usually can't search for that anyway because it's hidden in assets. |
I like the direction to more generic. But once we get to here I start thinking about the general use case of 'ordering' data - providers (like Planet) generate the geotiff's on demand. It'd be great to cover that too, and retrieval_needed vs immediately_available seem like they'd work. It probably wouldn't be a 'cloud storage' extension, but perhaps a pair of extensions - one on 'asset availability' or something like that, and one on cloud storage - with regions and requestor pays. The asset availability would also hopefully cover @m-mohr's public/protected/private as well.
@davidraleigh - I doubt I'll get to it in the next two weeks, so if you could do it within that time frame that'd be great. |
@cholmes I can make an attempt at a pull request for this week |
@davidraleigh - awesome! Be warned, we are going to move most of the extensions out of the core repo soon, see #946 But feel free to make a PR here, it'll just probably be applied to another repo. |
Circling back on this - we've got a lot of great energy on the cloud storage extension. But I don't think we need that for 1.0.0, as STAC works fine without it, and it'll be a nice addition to have in an extension (I'm not set on that, but would want to hear a good argument). But what do we want to actually say in the spec itself? Should we call out the use of s3:// style url's in a best practice? And say that those are recommended to be used when data is requestor pays? And to also recommend that people don't put their STAC metadata in requestor pays buckets? |
For main best practices:
|
Closing this, though note if we do get a storage extension soon then we should link to it from this best practice. |
As suggested on gitter by @matthewhanson - it'd be good to have a best practice on URL's that are 'requestor pays'. We should capture these thoughts and put them in best practices.
'For requester pays URLs I’ve been using the s3 URL, e.g., s3://syncarto-data-rp/stac/naip/catalog.json
Since the http URL is useless on it’s own unless you sign it, so just working with the s3 URLs directly (with AWS CLI or boto3) I think is easier. Plus you can use PySTAC to support s3 reads/writes.
If public, then I use the actual http URL.
This might be a good thing to add to best practice'
'Even better might be to keep the STAC metadata in a different, and completely public, bucket that isn’t requester pays. Normally I like the data alongside the STAC Items, but I think it’s better if it’s public. That way you can use tools like STAC browser, PySTAC, without authentication for just the metadata.'
The text was updated successfully, but these errors were encountered: