Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nightly build script to distribute parquet #3399

Merged
merged 3 commits into from
Feb 15, 2024
Merged

Conversation

zaneselvans
Copy link
Member

Overview

Distribute parquet outputs to a private bucket using the same path conventions that we're using in the public buckets for the other outputs.

Closes #3362

Testing

I wish there were an easy way to test this besides running the nightly builds and having them fail repeatedly.

To-do list

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. release Tasks directly related to data and software releases. parquet Issues related to the Apache Parquet file format which we use for long tables. nightly-builds Anything having to do with nightly builds or continuous deployment. labels Feb 13, 2024
@zaneselvans zaneselvans self-assigned this Feb 13, 2024
Comment on lines +82 to +84
# Only attempt to update outputs if we have an argument
# This avoids accidentally blowing away the whole bucket if it's not set.
if [[ -n "$1" ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we called this function without an argument it would delete all of our previous releases which is bad, so I added this small margin of safety. Maybe we should automatically make the versioned releases read-only as soon as we create them so this can't happen.

stable should always point at a versioned release, so replacing it if need be wouldn't be hard. nightly is ephemeral so not a disaster if it accidentally gets wiped out. And the build outputs for recent nightly builds are still laying around so we could repopulate it manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

It should be possible to make the versioned releases read-only. I created an issue for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Hopefully this is easy? I'd like to add this to the process / script for our next release.

Comment on lines 260 to 262
# Distribute Parquet outputs to a private bucket
distribute_parquet 2>&1 | tee -a "$LOGFILE"
DISTRIBUTE_PARQUET=${PIPESTATUS[0]}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to happen before we cleanup the outputs, because we remove all the parquet/*.parquet files there.

Comment on lines +102 to 124
function distribute_parquet() {
PARQUET_BUCKET="gs://parquet.catalyst.coop"
# Only attempt to update outputs if we have a real value of BUILD_REF
# This avoids accidentally blowing away the whole bucket if it's not set.
echo "Copying outputs to parquet distribution bucket"
if [[ -n "$BUILD_REF" ]]; then
if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then
# If running nightly builds, copy outputs to the "nightly" bucket path
DIST_PATH="nightly"
else
# Otherwise we want to copy them to a directory named after the tag/ref
DIST_PATH="$BUILD_REF"
fi
echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"

# If running a tagged release, ALSO update the stable distribution bucket path:
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
echo "Copying outputs to $PARQUET_BUCKET/stable" && \
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
fi
fi
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I tried moving the wholesale removal of the parquet/ directory from the output cleanup into the deployment function, but this didn't work because of the need to deploy twice when we have a versioned release (once to the versioned path, and once to stable)

Then I tried more selectively deploying outputs in the existing deployment function, instead of just copying everything in the directory, but this didn't work because the S3 CLI (incredibly) can't understand wildcards and there was no simple way to specify the set of parquet vs. non-parquet files for separate distribution.

Finally I made this terrible cut-and-paste copy of the deployment function which gets called before the output cleanup.

Is there a simple better way? Short of re-writing the entire thing in Python and bailing on these CLIs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.

Copy link
Member Author

@zaneselvans zaneselvans Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to copy_outputs_to_distribution_bucket() which calls upload_to_dist_path() twice with different arguments if we're doing a versioned (aka stable) release.

I know I was in disbelief on the S3 thing. They had pages of documentation for how to include/exclude files and jfc guys why didn't you just do the Unix thing? Anyway, yet another indication that we should be doing this work from within Python with libraries probably.

@zaneselvans zaneselvans marked this pull request as ready for review February 13, 2024 23:15
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I gave the deploy-pudl-vm-service-account the Storage Admin role on the gs://parquet.catalyst.coop bucket so it can create and delete files.

Comment on lines +82 to +84
# Only attempt to update outputs if we have an argument
# This avoids accidentally blowing away the whole bucket if it's not set.
if [[ -n "$1" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

It should be possible to make the versioned releases read-only. I created an issue for it.

Comment on lines +102 to 124
function distribute_parquet() {
PARQUET_BUCKET="gs://parquet.catalyst.coop"
# Only attempt to update outputs if we have a real value of BUILD_REF
# This avoids accidentally blowing away the whole bucket if it's not set.
echo "Copying outputs to parquet distribution bucket"
if [[ -n "$BUILD_REF" ]]; then
if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then
# If running nightly builds, copy outputs to the "nightly" bucket path
DIST_PATH="nightly"
else
# Otherwise we want to copy them to a directory named after the tag/ref
DIST_PATH="$BUILD_REF"
fi
echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"

# If running a tagged release, ALSO update the stable distribution bucket path:
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
echo "Copying outputs to $PARQUET_BUCKET/stable" && \
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
fi
fi
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.

@zaneselvans zaneselvans added this pull request to the merge queue Feb 14, 2024
Merged via the queue into main with commit 9d40b68 Feb 15, 2024
12 checks passed
@zaneselvans zaneselvans deleted the deploy-parquet branch February 15, 2024 00:08
cmgosnell pushed a commit that referenced this pull request Feb 15, 2024
* Update nightly build script to distribute parquet

* Fix logging cut-and-paste error

* Name parquet distribution success variable like all the others
github-merge-queue bot pushed a commit that referenced this pull request Feb 15, 2024
* first draft of all eia860m extraction

* first draft of transform process: runs through existing 860 transform does not do changelog yet

* simplify replaces in tranform and add changelog dropdupes

* first pass of adding full transform for eia860 and schema

* Fix bad monthly expand_timeseries

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* clean up settings and add alembic migration

* fix the settings grabbing in eia860 settings with new eia860m setup

* Convert 860m table into db table

* make a new 860m settings class, dont pass in report_date for 860, & use the right table name

* remove FK relationships to the changelog table and make expand_timeseries have a dec unit test

* change eia86m io manager to our cool new db + parquet manager

* add docs and fix b4by missp3lls and change tbl name

* add migration and update fast 860m month post new 860m integration

* alembic migrations

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

* Fix the working partitions in settings and helpers

* Fix settings partitions and be better about selecting 860m only columns

* Update nightly build script to distribute parquet (#3399)

* Update nightly build script to distribute parquet

* Fix logging cut-and-paste error

* Name parquet distribution success variable like all the others

* [pre-commit.ci] auto fixes from pre-commit.com hooks

For more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Zane Selvans <zane.selvans@catalyst.coop>
@zaneselvans zaneselvans linked an issue Feb 16, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nightly-builds Anything having to do with nightly builds or continuous deployment. output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables. release Tasks directly related to data and software releases.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update experimental parquet outputs in nightly builds Output PUDL as Parquet as well as SQLite
2 participants