Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 64: Asset garbage collection #379

Merged
merged 13 commits into from
Oct 30, 2024
Merged

RFC 64: Asset garbage collection #379

merged 13 commits into from
Oct 30, 2024

Conversation

kaizencc
Copy link
Contributor

@kaizencc kaizencc commented Sep 3, 2021

This is a request for comments about Asset Garbage Collection. See #64 for
additional details.

APIs are signed off by @rix0rrr .


By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache-2.0 license

text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved
text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved
text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved
text/0064-asset-garbage-collection.md Outdated Show resolved Hide resolved
-t, --type=[s3|ecr] filters for type of asset

Examples:
cdk gc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this zero-touch?

Could it be something we install into the account that runs periodically? I think that should be the end goal of the user experience.

We could complement that with a CLI experience, but only if one leads naturally into the other (in terms of development effort).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be way off base but a zero-touch solution would likely be a lambda function that runs once a week (or however long) versus the CLI experience which is by manual command. But at the end of the day, the actual code that gets run (tracing the assets in the environment and then deleting the unused ones) should be exactly the same.

It seems like if the CLI experience works then integrating it into a zero-touch solution would not require much additional work.

Copy link
Contributor

@eladb eladb Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1 on @kaizen3031593's response - I agree that eventually a scheduled tasks runs GCs is what users would want but I also think it makes sense to start by implementing cdk gc as the foundational building block that will be used in a automated task.

After we have a solid manual cdk gc (which will take some time to stabilize), we (or the community) can vend a construct that runs this periodically and maybe at some point we can include that in our bootstrapping stack.

@kaizen3031593 it's worth mentioning this "roadmap" in your RFC so that this vision is explicitly articulated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to be able to opt-in to this behavior as part of the bootstrap process. The CdkToolkit stack could maybe be the owner of the scheduled Lambda function that runs the GC process.

Comment on lines 104 to 105
assets that are being referenced by these stacks. All assets that are not reached via
tracing can be safely deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about which I think is not optional: how will we leave a safety margin for rollbacks?

As an example:

  • A pipeline deploys revision R1 with some version of some Lambda code (r1.zip)
  • A pipeline deploys revision R2, changing the Lambda code (r2.zip)
  • Garbage collection gets run (deleting r1.zip)
  • For whatever reason, the app team wants to rollback to R1, but r1.zip is now gone

I guess the answer to what we need to do in GC depends on how rollback is going to work. If it's going to be a full cdk deploy cycle we could rely on the CDK CLI to re-upload the assets... but how do CD systems work?

  • In a CD system like Amazon uses internally the CFN deployment would be executed, but the asset deployment wouldn't (and so r1.zip would be gone without recovery)
  • CodePipeline doesn't currently support rollbacks, but it could in the future (and chances are it would work like Amazon's does)

All in all, you need to think about this case and think about what is reasonable to expect from a CI/CD system and what is reasonable for us to implement, and write a couple words about that.

For example, a reasonable safety precaution would be we only delete assets 30 days (or pick a number) after they stop being used. That would allow for safe interoperability even with pessimistic CI/CD scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this

This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.

I don't think this can entirely be solved with timestamps. As I understand there are two kinds of race conditions:

  1. You fetch the templates when there were some in-progress stacks. If you deleted the assets cross-referencing such templates you might delete assets that would be needed if one or more of these in-progress stacks rolled-back (because the CFn template got would be the new one which was cause of the update). This cannot be remedied with comparing asset timestamps as it's orthogonal to that - it's about the 2 versions rather than two timestamps.
  2. You could also race with asset-uploads. If an asset got uploaded but the changeset wasn't executed yet and your workflow ran and referenced the about to be updated template, then you would delete the asset that just got uploaded and the stack update would fail. This is better than a rollback failure of (1) as it'll be an update failure - but still a failure.

I think (2) can be remedied by using a timestamp - leave alone all assets not older than X days where X is the max likely time difference in someone's pipeline between the asset-upload stage and changeset execution stage.

For (1), assuming that is correct, there's some atomicity needed. I guess if we could atomically ask for stack-status+template in the same call, we could decide to run the workflow only if none of the stacks were in in-progress state (as that can rollback to a template which you don't know about). Then combine it with the timestamp check of (2) so that it doesn't race with newer deployments. This will still race with create-in-progress as such a stack might not even have existed during the scan and you will end up deleting the assets of about to be created stack. Again age-check of (2) could remedy this. If there are multiple pipelines (thus multiple stacks) deploying to the same account+region, you will need to configure X to be the worst case max among all the pipelines reaching that stage (obviously - but just stating).


Probably not going to happen, but if we had per stack asset bucket-prefix (or entire bucket) and ecr, then we could simplify all this in case of pipeline deployments by having a post-deployment action that cleans all these assets blindly by referring to the just succeeded stack. We don't have to bother about other stacks in the account + it'll only happen after a successful update of a stack (once the stack is "stable") - so no need to worry about in-progress states. This combined with the "age" workaround might have been the best? The drawback is if you have several stacks/templates that actually reference the same asset, that would lead to asset duplication - but maybe that's OK?

Copy link

@blimmer blimmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great start, I like how you consider this a step-1/building block for gc, which many of my clients using CDK are increasingly needing to support.

@jogold
Copy link

jogold commented Feb 1, 2022

See https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/README.md for a working construct that does asset garbage collection.

@blimmer
Copy link

blimmer commented Feb 4, 2022

@jogold - thank you for this construct. I tried it out on my personal CDK project and it worked great! I think this is an excellent proof-of-concept and would love to see it eventually integrated into the official CDK project.

@kaizencc
Copy link
Contributor Author

@jogold thanks for the POC for asset garbage collection! This RFC is something that we have on our radar for later this year and when the time comes I'd be happy to iterate on this and get it into the CDK.

@ustulation
Copy link

hello @kaizencc / others , I've been following this for a long time and though about pinging to find out if this is now being worked upon as you mentioned that it'll happen later in the year and we are now nearing the end of the year.

@ustulation
Copy link

The problem I'm facing currently is that various compliance software highlight that there are insecure images in the ECR - it turns out they are pretty similar to what the automatic ECR scans provide (specially the ones labelled High for severity). They are all images from the past which are no longer in use as newer versions were pulled and used and so on. However since they all linger around, the compliance check fails because it doesn't know that we aren't using the older ones.

This ofc turns out to be a big problem for orgs which need to stick to such compliance checks, so we end up cross-referencing the CFn template and manually deleting the old images no longer in use.

@ustulation
Copy link

Assuming the problems mentioned here are real and not due to my lack of knowledge (please correct me otherwise, else I might just be overthinking all this) here's an algorithm that could work in practice:

  1. List all the stacks in the given env (account+region)
  2. For each stack, get the template body and immediately follow it by
  3. Get the stack status
  4. If the status is one of {UPDATE_IN_PROGRESS, UPDATE_FAILED, UPDATE_ROLLBACK_IN_PROGRESS}, abort the garbage cleanup as it's not safe - re-run the entire cycle at some later point, or whenever the user chooses
  5. Once all stack-templates are collected, extract all the hashes (the usual [a-f0-9]{64} assuming they are always going to be smaller cased).
  6. Check the cdk-staging bucket and ecr repo for all assets which contain any of the collected hashes. Leave them alone.
  7. For the rest, which aren't referenced by the templates, if they are older than X, delete them.

In absence of being able to atomically get both the template and the stack-status, this should practically work, though it has theoretical edge cases.

The most interesting here is the transition from 2 to 3. The only time we have a problem is when we get the template which was currently being applied (an UPDATE_IN_PROGRESS). That can be rolled back by CFn to a template we did not manage to collect, so we can't consider the one that we did and must abort. It can have a false positive for abortion in the case when you got 2 the stack was stable and it only went into an UPDATE between 2 and 3. In this case you could actually use the template you got and carry on, but since there is no way of knowing that, you just assume whatever 3 gives to be what the state was when 2 happened. So we play it safe and abort.

UPDATE_ROLLBACK_IN_PROGRESS in 3 is also not safe. It could also have been UPDATE_ROLLBACK_IN_PROGRESS in 2 in which case it would have been safe as the rollback template is what we got and what CFn will eventually finish up with, but there's no way of knowing this. In a bad case 2 could have been UPDATE_IN_PROGRESS but it went to UPDATE_ROLLBACK_IN_PROGRESS in 3 and the template has changed but you now have the template that is no longer good and you could end up deleting resources pointed to by the template being rolled back to (timestamps don't help, those resources might be a year old for eg. - you simply must not delete them). Since the transition from UPDATE_IN_PROGRESS -> UPDATE_ROLLBACK_IN_PROGRESS happens quickly, depending on the network latency etc. there is a realistic chance of this race being experienced in practice between steps 2 and 3.

Same with UPDATE_FAILED. UPDATE_IN_PROGRESS -> UPDATE_FAILED is fairly quick (even quicker than the one above). No use risking it, so abort.

Edge case:

The rest should be practically OK. There is a theoretical chance that the latency between calls 2 and 3 was big enough that in the meantime the stack went all the way from UPDATE_IN_PROGRESS state to UPDATE_ROLLBACK_COMPLETE state. So you would interpret the template you got in 2 as that of stack state when it's in UPDATE_ROLLBACK_COMPLETE state which you got in 3, and since that is a stable state (CFn will not apply any further updates by itself without user prompting it), you will "consider" that template when you actually shouldn't since that was the discarded template by CFn due to rolling back. However in practice, the time diff between 2 and 3 is in milliseconds or at worst a second or two. I've never seen a transition from UPDATE_IN_PROGRESS/UPDATE_FAILED (actual state during "2") --to--> UPDATE_ROLLBACK_IN_PROGRESS --to--> UPDATE_ROLLBACK_COMPLETE (actual state during "3") happen so fast that this will be a problem. I had to put a sleep between 2 and 3 to simulate this.


All other cases, such as you missed collecting a stack because it was CREATE_IN_PROGRESS after 1, or if some assets were just uploaded but you finished 2 and 3 for a stack that is soon going to be updated as you do 7 etc., can be solved by a timestamp comparison. While dealing with an environment, allow for the biggest interval for your pipeline between asset upload stage and CFn changeset execute stage in that env, and only delete assets that are older than this interval from the time the gc workflow runs for that env. It'll be larger for later stages in the pipeline (or you can just set a blanket expiry interval of X days + whatever-time-it-normally-takes-for-the-pipeline-to-reach-the-last-stage.


This doesn't deal with IMPORT_* family of statuses because i've never used them personally, but i believe it should follow the same logic as UPDATE_* above?


Alternatives

  1. If there was a CFn API call to (atomically) get both the template and the status then you wouldn't have to handle some of the case above explicitly. You would still need to either abort or poll-and-wait in case of UPDATE_IN_PROGRESS/UPDATE_FAILED because you would have got the template that would soon be discarded by CFn and rolled back to one which you don't have.
  2. If CFn had an API call that returned either one template (which is where the CFn "stopped" after applying - ie. reached a stable state) or two templates (the one CFn is currently applying + the one it would rollback to if the update failed), then you don't have to deal with this mess at all - just use both the templates when available and blindly delete resources not referenced by any of such collected templates (from all stacks in the env). This would be awesome! You will still need the timestamp check for the assets belonging to stacks you didn't know about during your scan etc. But that's always going to be the case unless there's an impossible global lock for the account.
  3. Have a different bootstrap-qualifier for each pipelined project. That way every stage can have an additional stack which depends on all the other stacks for the project in that stage, and whose sole job is to run the gc workflow after every deployment. It will only run once the upstream stacks are updated/created successfully and are stable. Or have a post-deployment action per stage in the pipeline that does this. In either case, the pipeline locks the deployment to the stage until all actions/updates/rollbacks complete, so you don't have to care about stack-status. Further due to separation of projects using cdk bootstrap-qualifier, the projects have their dedicated bucket/ecr-repo per stage and other stacks from other projects don't interfere. This would be a clean solution too i guess? Asset duplication would be a drawback - e.g. the source code of log-retention lambda etc. are usually shared for the same env. Now they'll be in per project asset/staging buckets - which may not be all that bad.

@ustulation
Copy link

I'm currently coding this algorithm for my project with an additional check to retry N number of times (and then abort), for every stack for which the time diff between start of 2 and end of 3 above is >2secs to safeguard against the edge case listed above.

rix0rrr
rix0rrr previously approved these changes Oct 29, 2024
@mergify mergify bot dismissed rix0rrr’s stale review October 30, 2024 14:48

Pull request has been modified.

@kaizencc kaizencc merged commit dc8b49a into aws:main Oct 30, 2024
2 checks passed
@kaizencc kaizencc deleted the 0064 branch October 30, 2024 15:51
@kaizencc kaizencc mentioned this pull request Oct 30, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants