Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Replace storage_option with AWS S3 bucket #49

Closed
18 tasks done
bendnorman opened this issue Sep 26, 2022 · 7 comments · Fixed by #70
Closed
18 tasks done

Replace storage_option with AWS S3 bucket #49

bendnorman opened this issue Sep 26, 2022 · 7 comments · Fixed by #70
Assignees

Comments

@bendnorman
Copy link
Member

bendnorman commented Sep 26, 2022

PUDL has been accepted to the Open Data Sponsorship Program on AWS, which covers storage and egress fees of S3 buckets that contain our data. This is great news because our users won't have to set up a GCP account to deal with requester pays.

Tasks:

  • Follow the onboard steps for the AWS program
  • Update the base URLs in the intake catalog.
  • Update the nightly build script to load the outputs to S3.
  • Add AWS credential secrets to github.
  • Share the program with OpenAddresses.
  • Update CATALOG_VERSION to v2022.11.30.
  • Copy v2022.11.30 outputs to AWS bucket.
  • Add install instruction to pudl-catalog.
  • Add link to readthedocs site on purl-catalog github.
  • Try interacting with the intake catalog using the s3 bucket.
  • Update documentation explaining AWS bucket. Remove the requester pays documentation.
  • Figure out how to download objects from s3 bucket for internal. Add to PUDL documentation.
  • Double check the logs are working as expected (interaction from intake is logged, IP, size, file). Log fields
  • Get MFA recovery code and make sure there is a second MFA method.
  • Give Zane awsopendata credentials.
  • Create a tutorial on how to use the intake catalogs in AWS. Can we just say: run this code in a jupyter notebook running in EMR Studio? Maybe include Athena queries from parquet files? See example.
  • Add yaml file to the AWS open data github.
  • Add s3 integration tests to pudl-catalog.

AWS Notes

AWS MFA Notes

@bendnorman bendnorman self-assigned this Sep 26, 2022
@zaneselvans
Copy link
Member

Notes on the AWS Open Data Program terms and conditions:

(g) provide AWS with information reasonably requested concerning End User use of Program Content.

Do we have any idea what this would entail?

Your participation in the Program will terminate two (2) years from the Effective Date.

I think this argues for continuing to ensure that the data is also deposited in the intake.catalyst.coop bucket, and that the storage URL is configurable, so that current and past versions of the data will be available long term regardless of how AWS changes the program.

You may not issue any press release or other public statement with respect to your participation in the Program unless we approve in advance and in writing.

This seems kind of ridiculous. Do we need to get their permission to mention it in the README of the pudl-catalog repo or our documentation? Or to write a blog post? Or to tweet about it? I guess it doesn't particularly matter that much as long as we can mysteriously provide free access to all of the data outputs.

@zaneselvans
Copy link
Member

Maybe we should have a quick board meeting item with notes tomorrow to give you the power to enter into The Agreement.

@bendnorman
Copy link
Member Author

Just emailed them with your questions. I'll add this to tomorrow's board meeting agenda if they respond in time.

I think they only need to improve press releases but they don't need to approve supporting social media posts as long as they follow the PR guidelines:
image.png

@bendnorman
Copy link
Member Author

AWS response:

On twitter and documentation:

Hi Ben, B.1 in https://assets.opendata.aws/aws-onboarding-handbook-for-data-providers-en-US.pdf should help answer the first question, but no need to ask us for permission to talk about your participation on Twitter or documentation.

Can we assume we'll get a renewal after two years?

For renewals, we do not post a percentage, but if we did, it’d be close to 100%. You can see a number of the datasets listed at https://registry.opendata.aws have been there for some time and are not disappearing because we are not renewing. Once data is made available through the program, it’s a bad experience to have it removed if users are depending on it. So while we retain the right not to renew, we generally renew.

What does "(g) provide AWS with information reasonably requested concerning End User use of Program Content." mean?

To your last question, from time to time we may make requests for use cases around the data usage to help us show the value of the data publicly. If you can share, great! If not, it’s not a problem. Even though the language is a bit vague, we actually cannot even accept any more detailed information (like detailed usage by customer, if you had it) without some sort of data sharing agreement in place.

@bendnorman bendnorman linked a pull request Dec 15, 2022 that will close this issue
@bendnorman
Copy link
Member Author

bendnorman commented Dec 15, 2022

Something I didn't consider was the egress fees from the GCP vms to the AWS bucket. Each time the nightly builds succeeds we'll need to copy the outputs from the VM to the AWS bucket. We currently output about 11 GB of data. Premium network egress pricing is $0.12/GB and standard is $0.085/GB (the VM network is set to premium by default but this can be changed). Assuming all of our nightly builds succeed, the aws cp doesn't do any compression and we use the standard network we'll end up spending about $250 on egress in a year. I think this is a reasonable price to pay for our users to have free access to the data and for us to continue to use our GCP nightly builds. An alternative would be to migrate our nightly build infrastructure to AWS but that doesn't feel worth it.

@zaneselvans
Copy link
Member

@bendnorman were there still tasks under this issue that need to be completed?

@bendnorman
Copy link
Member Author

This is done for now. I will hold off on adding PUDL to the AWS quarterly newsletter until the catalog is more useable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants