Update nightly build script to distribute parquet #3399

zaneselvans · 2024-02-13T22:59:43Z

Overview

Distribute parquet outputs to a private bucket using the same path conventions that we're using in the public buckets for the other outputs.

Closes #3362

Testing

I wish there were an easy way to test this besides running the nightly builds and having them fail repeatedly.

To-do list

Give feedback

Ensure docs build, unit & integration tests, and test coverage pass locally with make pytest-coverage (otherwise the merge queue may reject your PR)
For significant ETL changes, ensure the full ETL runs locally
For major data coverage & analysis changes, run data validation tests
If updating analyses or data processing functions: make sure to update or write data validation tests
Update the release notes: reference the PR and related issues.
Review the PR yourself and call out any questions or issues you have
Options

zaneselvans · 2024-02-13T23:02:14Z

docker/gcp_pudl_etl.sh

+    # Only attempt to update outputs if we have an argument
+    # This avoids accidentally blowing away the whole bucket if it's not set.
+    if [[ -n "$1" ]]; then


I think if we called this function without an argument it would delete all of our previous releases which is bad, so I added this small margin of safety. Maybe we should automatically make the versioned releases read-only as soon as we create them so this can't happen.

stable should always point at a versioned release, so replacing it if need be wouldn't be hard. nightly is ephemeral so not a disaster if it accidentally gets wiped out. And the build outputs for recent nightly builds are still laying around so we could repopulate it manually.

Thanks for catching this!

It should be possible to make the versioned releases read-only. I created an issue for it.

Great. Hopefully this is easy? I'd like to add this to the process / script for our next release.

zaneselvans · 2024-02-13T23:02:58Z

docker/gcp_pudl_etl.sh

+        # Distribute Parquet outputs to a private bucket
+        distribute_parquet 2>&1 | tee -a "$LOGFILE"
+        DISTRIBUTE_PARQUET=${PIPESTATUS[0]}


This has to happen before we cleanup the outputs, because we remove all the parquet/*.parquet files there.

zaneselvans · 2024-02-13T23:09:45Z

docker/gcp_pudl_etl.sh

+function distribute_parquet() {
+    PARQUET_BUCKET="gs://parquet.catalyst.coop"
+    # Only attempt to update outputs if we have a real value of BUILD_REF
+    # This avoids accidentally blowing away the whole bucket if it's not set.
+    echo "Copying outputs to parquet distribution bucket"
+    if [[ -n "$BUILD_REF" ]]; then
+        if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then
+            # If running nightly builds, copy outputs to the "nightly" bucket path
+            DIST_PATH="nightly"
+        else
+            # Otherwise we want to copy them to a directory named after the tag/ref
+            DIST_PATH="$BUILD_REF"
+        fi
+        echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \
+        gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"
+
+        # If running a tagged release, ALSO update the stable distribution bucket path:
+        if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
+            echo "Copying outputs to $PARQUET_BUCKET/stable" && \
+            gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
+        fi
+    fi
 }


First I tried moving the wholesale removal of the parquet/ directory from the output cleanup into the deployment function, but this didn't work because of the need to deploy twice when we have a versioned release (once to the versioned path, and once to stable)

Then I tried more selectively deploying outputs in the existing deployment function, instead of just copying everything in the directory, but this didn't work because the S3 CLI (incredibly) can't understand wildcards and there was no simple way to specify the set of parquet vs. non-parquet files for separate distribution.

Finally I made this terrible cut-and-paste copy of the deployment function which gets called before the output cleanup.

Is there a simple better way? Short of re-writing the entire thing in Python and bailing on these CLIs?

Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.

I was referring to copy_outputs_to_distribution_bucket() which calls upload_to_dist_path() twice with different arguments if we're doing a versioned (aka stable) release.

I know I was in disbelief on the S3 thing. They had pages of documentation for how to include/exclude files and jfc guys why didn't you just do the Unix thing? Anyway, yet another indication that we should be doing this work from within Python with libraries probably.

bendnorman

Looks good! I gave the deploy-pudl-vm-service-account the Storage Admin role on the gs://parquet.catalyst.coop bucket so it can create and delete files.

bendnorman · 2024-02-14T18:54:55Z

docker/gcp_pudl_etl.sh

+    # Only attempt to update outputs if we have an argument
+    # This avoids accidentally blowing away the whole bucket if it's not set.
+    if [[ -n "$1" ]]; then


Thanks for catching this!

It should be possible to make the versioned releases read-only. I created an issue for it.

bendnorman · 2024-02-14T18:58:20Z

docker/gcp_pudl_etl.sh

+function distribute_parquet() {
+    PARQUET_BUCKET="gs://parquet.catalyst.coop"
+    # Only attempt to update outputs if we have a real value of BUILD_REF
+    # This avoids accidentally blowing away the whole bucket if it's not set.
+    echo "Copying outputs to parquet distribution bucket"
+    if [[ -n "$BUILD_REF" ]]; then
+        if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then
+            # If running nightly builds, copy outputs to the "nightly" bucket path
+            DIST_PATH="nightly"
+        else
+            # Otherwise we want to copy them to a directory named after the tag/ref
+            DIST_PATH="$BUILD_REF"
+        fi
+        echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \
+        gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"
+
+        # If running a tagged release, ALSO update the stable distribution bucket path:
+        if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
+            echo "Copying outputs to $PARQUET_BUCKET/stable" && \
+            gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
+        fi
+    fi
 }


Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.

* Update nightly build script to distribute parquet * Fix logging cut-and-paste error * Name parquet distribution success variable like all the others

* first draft of all eia860m extraction * first draft of transform process: runs through existing 860 transform does not do changelog yet * simplify replaces in tranform and add changelog dropdupes * first pass of adding full transform for eia860 and schema * Fix bad monthly expand_timeseries * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * clean up settings and add alembic migration * fix the settings grabbing in eia860 settings with new eia860m setup * Convert 860m table into db table * make a new 860m settings class, dont pass in report_date for 860, & use the right table name * remove FK relationships to the changelog table and make expand_timeseries have a dec unit test * change eia86m io manager to our cool new db + parquet manager * add docs and fix b4by missp3lls and change tbl name * add migration and update fast 860m month post new 860m integration * alembic migrations * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Fix the working partitions in settings and helpers * Fix settings partitions and be better about selecting 860m only columns * Update nightly build script to distribute parquet (#3399) * Update nightly build script to distribute parquet * Fix logging cut-and-paste error * Name parquet distribution success variable like all the others * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Zane Selvans <zane.selvans@catalyst.coop>

Update nightly build script to distribute parquet

ba15b4d

zaneselvans requested a review from bendnorman February 13, 2024 22:59

zaneselvans self-assigned this Feb 13, 2024

Fix logging cut-and-paste error

1ecf8e5

zaneselvans commented Feb 13, 2024

View reviewed changes

zaneselvans marked this pull request as ready for review February 13, 2024 23:15

Name parquet distribution success variable like all the others

831df45

bendnorman approved these changes Feb 14, 2024

View reviewed changes

zaneselvans added this pull request to the merge queue Feb 14, 2024

Merged via the queue into main with commit 9d40b68 Feb 15, 2024
12 checks passed

zaneselvans deleted the deploy-parquet branch February 15, 2024 00:08

cmgosnell pushed a commit that referenced this pull request Feb 15, 2024

Update nightly build script to distribute parquet (#3399)

6c8bb8d

* Update nightly build script to distribute parquet * Fix logging cut-and-paste error * Name parquet distribution success variable like all the others

zaneselvans linked an issue Feb 16, 2024 that may be closed by this pull request

Output PUDL as Parquet as well as SQLite #3102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update nightly build script to distribute parquet #3399

Update nightly build script to distribute parquet #3399

zaneselvans commented Feb 13, 2024

To-do list

zaneselvans Feb 13, 2024

bendnorman Feb 14, 2024

zaneselvans Feb 14, 2024

zaneselvans Feb 13, 2024

zaneselvans Feb 13, 2024

bendnorman Feb 14, 2024

zaneselvans Feb 14, 2024 •

edited

Loading

bendnorman left a comment

bendnorman Feb 14, 2024

bendnorman Feb 14, 2024

Update nightly build script to distribute parquet #3399

Update nightly build script to distribute parquet #3399

Conversation

zaneselvans commented Feb 13, 2024

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Feb 14, 2024 •

edited

Loading