-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nightly build script to distribute parquet #3399
Conversation
# Only attempt to update outputs if we have an argument | ||
# This avoids accidentally blowing away the whole bucket if it's not set. | ||
if [[ -n "$1" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we called this function without an argument it would delete all of our previous releases which is bad, so I added this small margin of safety. Maybe we should automatically make the versioned releases read-only as soon as we create them so this can't happen.
stable
should always point at a versioned release, so replacing it if need be wouldn't be hard. nightly
is ephemeral so not a disaster if it accidentally gets wiped out. And the build outputs for recent nightly
builds are still laying around so we could repopulate it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
It should be possible to make the versioned releases read-only. I created an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Hopefully this is easy? I'd like to add this to the process / script for our next release.
docker/gcp_pudl_etl.sh
Outdated
# Distribute Parquet outputs to a private bucket | ||
distribute_parquet 2>&1 | tee -a "$LOGFILE" | ||
DISTRIBUTE_PARQUET=${PIPESTATUS[0]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to happen before we cleanup the outputs, because we remove all the parquet/*.parquet
files there.
function distribute_parquet() { | ||
PARQUET_BUCKET="gs://parquet.catalyst.coop" | ||
# Only attempt to update outputs if we have a real value of BUILD_REF | ||
# This avoids accidentally blowing away the whole bucket if it's not set. | ||
echo "Copying outputs to parquet distribution bucket" | ||
if [[ -n "$BUILD_REF" ]]; then | ||
if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then | ||
# If running nightly builds, copy outputs to the "nightly" bucket path | ||
DIST_PATH="nightly" | ||
else | ||
# Otherwise we want to copy them to a directory named after the tag/ref | ||
DIST_PATH="$BUILD_REF" | ||
fi | ||
echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \ | ||
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH" | ||
|
||
# If running a tagged release, ALSO update the stable distribution bucket path: | ||
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then | ||
echo "Copying outputs to $PARQUET_BUCKET/stable" && \ | ||
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable" | ||
fi | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I tried moving the wholesale removal of the parquet/
directory from the output cleanup into the deployment function, but this didn't work because of the need to deploy twice when we have a versioned release (once to the versioned path, and once to stable
)
Then I tried more selectively deploying outputs in the existing deployment function, instead of just copying everything in the directory, but this didn't work because the S3 CLI (incredibly) can't understand wildcards and there was no simple way to specify the set of parquet vs. non-parquet files for separate distribution.
Finally I made this terrible cut-and-paste copy of the deployment function which gets called before the output cleanup.
Is there a simple better way? Short of re-writing the entire thing in Python and bailing on these CLIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to copy_outputs_to_distribution_bucket()
which calls upload_to_dist_path()
twice with different arguments if we're doing a versioned (aka stable
) release.
I know I was in disbelief on the S3 thing. They had pages of documentation for how to include/exclude files and jfc guys why didn't you just do the Unix thing? Anyway, yet another indication that we should be doing this work from within Python with libraries probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I gave the deploy-pudl-vm-service-account
the Storage Admin role on the gs://parquet.catalyst.coop
bucket so it can create and delete files.
# Only attempt to update outputs if we have an argument | ||
# This avoids accidentally blowing away the whole bucket if it's not set. | ||
if [[ -n "$1" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
It should be possible to make the versioned releases read-only. I created an issue for it.
function distribute_parquet() { | ||
PARQUET_BUCKET="gs://parquet.catalyst.coop" | ||
# Only attempt to update outputs if we have a real value of BUILD_REF | ||
# This avoids accidentally blowing away the whole bucket if it's not set. | ||
echo "Copying outputs to parquet distribution bucket" | ||
if [[ -n "$BUILD_REF" ]]; then | ||
if [[ "$GITHUB_ACTION_TRIGGER" == "schedule" ]]; then | ||
# If running nightly builds, copy outputs to the "nightly" bucket path | ||
DIST_PATH="nightly" | ||
else | ||
# Otherwise we want to copy them to a directory named after the tag/ref | ||
DIST_PATH="$BUILD_REF" | ||
fi | ||
echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \ | ||
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH" | ||
|
||
# If running a tagged release, ALSO update the stable distribution bucket path: | ||
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then | ||
echo "Copying outputs to $PARQUET_BUCKET/stable" && \ | ||
gsutil -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable" | ||
fi | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which deployment function are you referring to? Also, the s3 CLI not understanding wildcards is incredibly frustrating.
* Update nightly build script to distribute parquet * Fix logging cut-and-paste error * Name parquet distribution success variable like all the others
* first draft of all eia860m extraction * first draft of transform process: runs through existing 860 transform does not do changelog yet * simplify replaces in tranform and add changelog dropdupes * first pass of adding full transform for eia860 and schema * Fix bad monthly expand_timeseries * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * clean up settings and add alembic migration * fix the settings grabbing in eia860 settings with new eia860m setup * Convert 860m table into db table * make a new 860m settings class, dont pass in report_date for 860, & use the right table name * remove FK relationships to the changelog table and make expand_timeseries have a dec unit test * change eia86m io manager to our cool new db + parquet manager * add docs and fix b4by missp3lls and change tbl name * add migration and update fast 860m month post new 860m integration * alembic migrations * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Fix the working partitions in settings and helpers * Fix settings partitions and be better about selecting 860m only columns * Update nightly build script to distribute parquet (#3399) * Update nightly build script to distribute parquet * Fix logging cut-and-paste error * Name parquet distribution success variable like all the others * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Zane Selvans <zane.selvans@catalyst.coop>
Overview
Distribute parquet outputs to a private bucket using the same path conventions that we're using in the public buckets for the other outputs.
Closes #3362
Testing
I wish there were an easy way to test this besides running the nightly builds and having them fail repeatedly.
To-do list