Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Add s3 bucket support #70

Merged
merged 16 commits into from
Dec 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 20 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,19 @@ See also:
PUDL Catalog Usage
------------------

Accessing Public Cloud Data with Requester Pays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Installation
~~~~~~~~~~~~
You can install the PUDL Catalog using conda:

To control the cost of distributing potentially large quantities of public data, we are
using "requester pays." This means that whoever is downloading the data pays the modest
data egress fees. If you're not familiar with GCP or requester pays, we've written a
short guide to setting up a GCP project to work with this type of data. See the
`PUDL Catalog documentation <https://catalystcoop-pudl-catalog.readthedocs.io/en/latest/>`__.
for details, and also the
`GCP Documentation on accessing Requester Pays data <https://cloud.google.com/storage/docs/using-requester-pays#using>`__
.. code:: text

conda install -c conda-forge catalystcoop.pudl

or pip:

.. code:: text

pip install catalystcoop.pudl-catalog

Import the Intake Catalogs
~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -184,6 +187,11 @@ types:
'path': 'https://creativecommons.org/licenses/by/4.0'},
'catalog_dir': '/home/zane/code/catalyst/pudl-catalog/src/pudl_catalog/'}}

.. note::

If the data has not been cached this method might take a while to finish depending
on your internet speed. The EPA CEMS parquet data is almost 5 GB.

Read some data from the catalog
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -390,3 +398,6 @@ Funding
This work is supported by a generous grant from the `Alfred P. Sloan Foundation
<https://sloan.org/>`__ and their `Energy & Environment Program
<https://sloan.org/programs/research/energy-and-environment>`__

Storage and egress fees for this data are covered by `Amazon Web Services's
Open Data Sponsorship Program <https://aws.amazon.com/opendata/open-data-sponsorship-program/>`__.
1 change: 0 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ The PUDL Data Catalog
:hidden:
:maxdepth: 2

Requester Pays <requester_pays>
Licensing <LICENSE>
Code of Conduct <code_of_conduct>
Release Notes <release_notes>
Expand Down
268 changes: 0 additions & 268 deletions docs/requester_pays.rst

This file was deleted.

4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@
zip_safe=False,
python_requires=">=3.8,<3.12",
install_requires=[
"boto3==1.24.59",
"s3fs>=2021.7,<2022.11.1",
"gcsfs>=2021.7,<2022.11.1",
"intake_parquet>=0.2.3,<0.3",
"intake_sqlite>=0.2.0",
Expand All @@ -71,7 +73,7 @@
"doc8>=0.9,<1.1", # Ensures clean documentation formatting
"flake8>=4,<7", # A framework for linting & static analysis
"flake8-builtins>=1.5,<3", # Avoid shadowing Python built-in names
"flake8-colors>=0.1,<0.2", # Produce colorful error / warning output
"flake8-colors>=0.1.9,<0.2", # Produce colorful error / warning output
"flake8-docstrings>=1.5,<2", # Ensure docstrings are formatted well
"flake8-rst-docstrings>=0.2,<0.4", # Allow use of ReST in docstrings
"flake8-use-fstring>=1,<2", # Highlight use of old-style string formatting
Expand Down
7 changes: 4 additions & 3 deletions src/pudl_catalog/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,11 @@

INTAKE_BUCKET = "intake.catalyst.coop"
# Ideally we would set this version automatically using setuptools_scm...
CATALOG_VERSION = "v0.1.1"
CATALOG_VERSION = "dev"

BASE_URLS = {
"gs": f"gs://{INTAKE_BUCKET}/{CATALOG_VERSION}",
"s3": f"s3://{INTAKE_BUCKET}/{CATALOG_VERSION}",
# HTTPS access doesn't really work well, so we're hiding it from users for now.
"https": f"https://storage.googleapis.com/{INTAKE_BUCKET}/{CATALOG_VERSION}",
}
Expand All @@ -25,9 +26,9 @@
if os.getenv("PUDL_INTAKE_PATH") is None:
logger.info(
"Environment variable PUDL_INTAKE_PATH is not set. "
f"Defaulting to {BASE_URLS['gs']}"
f"Defaulting to {BASE_URLS['s3']}"
)
os.environ["PUDL_INTAKE_PATH"] = BASE_URLS["gs"]
os.environ["PUDL_INTAKE_PATH"] = BASE_URLS["s3"]

if os.getenv("PUDL_INTAKE_CACHE") is None:
logger.info(
Expand Down
Loading