Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-676] Create data loading procedure for each parquet datatype #132

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

philerooski
Copy link
Contributor

@philerooski philerooski commented Jul 22, 2024

For each of our Parquet data types, we create a stored procedure which can do a COPY INTO operation from the S3 location of that data type into a Snowflake table. Additionally, I've added an external stage for our development S3 bucket. We do not call these stored procedures as part of the deployment process. Instead, we will call the procedures using Tasks (future work).

  • snowflake/objects/database/recover/schema/parquet/procedure/copy_into_table_from_stage.sql
    This is the stored procedure logic. This procedure will copy data from a specific stage (S3) location, but can copy into any Snowflake table (specified by the target_table parameter).
  • snowflake/objects/database/recover/schema/parquet/procedure/deploy.sql
    This handles the logic of deploying the above procedure for each of our Parquet data types.
  • snowflake/objects/database/recover/schema/parquet/deploy.sql
    This is our (relative) top-level deployment script which handles deployment for all the child objects under the Parquet schema. We've added some logic which will create the procedures. Depending on whether this is a dev or prod/staging (i.e., the branch is main) deployment, we will have the procedures reference the production or development stage.
  • snowflake/objects/database/recover/schema/parquet/stage/*
    Everything under this directory is related to deploying the prod and dev stages.

@philerooski philerooski requested a review from a team as a code owner July 22, 2024 03:57
Copy link

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! The test seems to be failing for certain portions though.

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great!

Are there any instructions written up around how to run these items via CLI? I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

@philerooski
Copy link
Contributor Author

@BryanFauble

I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

This is one area where I feel like Snowflake starts tripping over itself. The only way to invoke a SQL script locally (afaik) is with snow sql -f, but that requires Jinja variables to be formatted like &{ this }, whereas in an actual deployment you would use EXECUTE IMMEDIATE FROM statements to invoke the SQL, which uses {{ this }} format. And we can't just snow sql -q "EXECUTE IMMEDIATE FROM ..." to run local SQL files because Snowflake requires the top-level SQL script in your EXECUTE IMMEDIATE FROM chain to be in a stage -- which means we need to push the code to an internal Snowflake stage or push it to Git and use an external stage.

I've been developing without variables as much as possible, then amending my git commits whenever I'm ready to start using variables. This is pretty inconvenient. It might be easier to replace {{ this }} with &{ this } when developing and then switch the formatting when I'm ready to push to Git. I'm still trying to figure out the best way to go about development.

@philerooski
Copy link
Contributor Author

@thomasyu888

The test seems to be failing for certain portions though.

Seems like it might have had something to do with Synapse being down for maintenance Sunday evening?

@BryanFauble
Copy link
Contributor

@BryanFauble

I know during code spaces you were covering a few ways that running these were a bit different from passing arguments normally. I'm thinking about this from the perspective of "If I need to make a change, what is the best way for me to test the changes?"

This is one area where I feel like Snowflake starts tripping over itself. The only way to invoke a SQL script locally (afaik) is with snow sql -f, but that requires Jinja variables to be formatted like &{ this }, whereas in an actual deployment you would use EXECUTE IMMEDIATE FROM statements to invoke the SQL, which uses {{ this }} format. And we can't just snow sql -q "EXECUTE IMMEDIATE FROM ..." to run local SQL files because Snowflake requires the top-level SQL script in your EXECUTE IMMEDIATE FROM chain to be in a stage -- which means we need to push the code to an internal Snowflake stage or push it to Git and use an external stage.

I've been developing without variables as much as possible, then amending my git commits whenever I'm ready to start using variables. This is pretty inconvenient. It might be easier to replace {{ this }} with &{ this } when developing and then switch the formatting when I'm ready to push to Git. I'm still trying to figure out the best way to go about development.

@philerooski What I have been doing in the Terraform world is to just commit my changes and push to origin:
Sage-Bionetworks-Workflows/eks-stack@main...ibcdpe-935-vpc-updates

I'm then letting the CICD pipeline automatically run for me - And then I look at the results in the UI. I can do all of this locally, but this was more than sufficient for me.

Perhaps we would need to follow the same thing here where you push changes to origin and watch the github action that runs?

What this would prevent in your case is if a mistake is made when you "switch the formatting when I'm ready to push to Git"

@philerooski
Copy link
Contributor Author

@BryanFauble That's the process I use now, letting the CI/CD do the deployment after I amend the commit and force push. A few (minor-ish) downsides:

  1. My deployment method doesn't take into account unchanged "stacks". Everything needs to be deployed from scratch, which is a lot slower than you would hope if all you're trying to do is update a single object. So I end up manually commenting out time-consuming steps like snowflake/objects/.../table/deploy.sql.
  2. RECOVER's CI/CD includes a lot of other time-consuming stuff like AWS deployments and tests -- which aren't blocking the Snowflake deployment in the CI/CD, but also aren't related and don't need to run.

@philerooski philerooski merged commit 6a610b1 into main Jul 23, 2024
18 checks passed
@philerooski philerooski deleted the etl-676 branch July 23, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants