Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-670] Adjust cleanup job to run on main directory of dev env #126

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

BryanFauble
Copy link
Contributor

@BryanFauble BryanFauble commented Jul 13, 2024

Problem:

  1. We do want to run the cleanup on the main directory of the dev branch

Solution:

  1. Updating the workflow so the job is triggered and we are able to clean the main directory

Testing:

  1. Will be looking at the workflow run

Copy link
Contributor

@rxu17 rxu17 Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually making me rethink the logic of integration-test-develop as well. integration-test-develop is just copying the recent 2 weeks of exports and should just be triggering the branch's specific namespaced jobs to run on that subset of exports but it shouldn't be doing it for the main namespace in the dev project when we merge to main because main will always run on all of the data in dev to mirror what it does in prod (which it's currently doing)

I am thinking now that these are the updates we could make and it makes a bit more sense:

  • integration-test-develop needs the if: github.ref_name != 'main' line at the highest level of the GH job
  • integration-test-develop-cleanup needs the if: github.ref_name != 'main' line at the highest level of the GH job
  • sceptre-deploy-staging will have the update:needs: sceptre-deploy-develop

I think this will give the logic we need. That the integration test and cleanup for when we are in dev environment to only run when we are working on the namespaced branch, and doesn't run when we merge to main.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the latest comments on this PR, but I think we want the same behavior when there are changes on dev/main as any other dev/* namespace.

Copy link
Contributor

@philerooski philerooski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is what we want. We don't need to handle integration tests in dev/main any differently than we handle tests for our feature branch. We've already specified environment = develop on L265, so this is happening separately from our production pipeline and there's no harm in running the integration tests in the main namespace.

@BryanFauble
Copy link
Contributor Author

We've already specified environment = develop on L265, so this is happening separately from our production pipeline and there's no harm in running the integration tests in the main namespace.

@philerooski From @rxu17's comments on #121 (comment) - It sounded like we did not want to run cleanup for main.

I agree that we should be running integration tests in the main namespace, but is it acceptable that we would be cleaning up the directories in the main namespace?

@BryanFauble
Copy link
Contributor Author

I threw this mermaid diagram together for my sake to visualize this workflow:

flowchart LR;
  pre-commit-->upload-files-->sceptre-deploy-develop
  pre-commit-->nonglue-unit-tests-->sceptre-deploy-develop
  pre-commit-->pytest-docker-->glue-unit-tests-->sceptre-deploy-develop-->integration-test-develop
  sceptre-deploy-develop-->integration-test-develop-cleanup-->integration-test-develop
  integration-test-develop-->sceptre-deploy-staging[sceptre-deploy-staging\nif: github.ref_name == 'main']-->integration-test-staging
  sceptre-deploy-staging-->integration-test-staging-cleanup
  integration-test-staging-cleanup[integration-test-staging-cleanup\nif: github.ref_name == 'main']-->integration-test-staging
Loading

@philerooski
Copy link
Contributor

@BryanFauble Our dev data has a permanent home in the pilot-data prefix in the input bucket. Anything in the main prefix is ephemeral and any data in this prefix in any of the buckets ought to reflect HEAD of the main branch.

@rxu17
Copy link
Contributor

rxu17 commented Jul 15, 2024

I don't think this is what we want. We don't need to handle integration tests in dev/main any differently than we handle tests for our feature branch. We've already specified environment = develop on L265, so this is happening separately from our production pipeline and there's no harm in running the integration tests in the main namespace.

So the reason I thought we don't need to run integration tests for main namespace of develop is because we already run it for the feature_branch namespace of develop. But no nevermind, we do need integration tests for main namespace to run so that we are triggering the S3 to JSON job runs for main of develop after we merge a feature branch for main and we don't need to manually run that part.

In that case, then I think it's fine if we cleanup the main namespace of develop in both the input and intermediate data buckets. In our case, because all of our pilot data gets copied over (because we only have more than 2 weeks of exports for pilot data), we maintain the expected workflow for main namespace which is that the main namespaced jobs run on all of the input data

@BryanFauble BryanFauble changed the title [ETL-670] Adjust cleanup job to run always, but skip steps if main [ETL-670] Adjust cleanup job to run on main directory of dev env Jul 15, 2024
@BryanFauble
Copy link
Contributor Author

Thanks for the feedback @rxu17 @philerooski Take a look at the latest round of changes. We should always clean the directories now in develop

@@ -56,9 +54,6 @@ def main() -> None:
if not args.bucket_prefix or args.bucket_prefix[-1] != "/":
raise ValueError("Bucket prefix must be provided and end with a '/'")

if "main" in args.bucket_prefix:
Copy link
Contributor

@rxu17 rxu17 Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot about us having to remove this if we want to clean up main. Hmm, should we add another safeguard?

if dev not in args.bucket:
   raise ValueError("Must be using buckets in the develop environment")

OR

dev_buckets = ["recover-dev-input-data", "recover-dev-intermediate-data", "recover-dev-processed-data", "recover-dev-cloudformation"]

if args.bucket not in dev_buckets:
   raise ValueError("Must be using buckets in the develop environment")

Overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also be running this on the prod buckets too - But the staging directory of those buckets. I think controlling this in the workflow makes the most sense here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a safeguard for prod/main (we still want to be able to clean up prod/staging). I think this script should be agnostic to the object key, except for the aforementioned case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this guard condition look?

6e36626

Copy link
Contributor

@rxu17 rxu17 Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my brain is not functioning right now. Yes I second what Phil said.
Looks good to me!

@rxu17 rxu17 self-requested a review July 15, 2024 18:17
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one comment!

Copy link

@@ -56,8 +54,10 @@ def main() -> None:
if not args.bucket_prefix or args.bucket_prefix[-1] != "/":
raise ValueError("Bucket prefix must be provided and end with a '/'")

if "main" in args.bucket_prefix:
raise ValueError("Cannot delete objects in the main directory")
if "main" in args.bucket_prefix and "recover-dev" not in args.bucket:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@BryanFauble BryanFauble merged commit ae2cf3c into main Jul 15, 2024
17 checks passed
@BryanFauble BryanFauble deleted the etl-670-adjust-dev-cleanup branch July 15, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants