-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 64: Asset garbage collection #379
Conversation
-t, --type=[s3|ecr] filters for type of asset | ||
|
||
Examples: | ||
cdk gc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this zero-touch?
Could it be something we install into the account that runs periodically? I think that should be the end goal of the user experience.
We could complement that with a CLI experience, but only if one leads naturally into the other (in terms of development effort).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be way off base but a zero-touch solution would likely be a lambda function that runs once a week (or however long) versus the CLI experience which is by manual command. But at the end of the day, the actual code that gets run (tracing the assets in the environment and then deleting the unused ones) should be exactly the same.
It seems like if the CLI experience works then integrating it into a zero-touch solution would not require much additional work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, +1 on @kaizen3031593's response - I agree that eventually a scheduled tasks runs GCs is what users would want but I also think it makes sense to start by implementing cdk gc
as the foundational building block that will be used in a automated task.
After we have a solid manual cdk gc
(which will take some time to stabilize), we (or the community) can vend a construct that runs this periodically and maybe at some point we can include that in our bootstrapping stack.
@kaizen3031593 it's worth mentioning this "roadmap" in your RFC so that this vision is explicitly articulated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to be able to opt-in to this behavior as part of the bootstrap process. The CdkToolkit
stack could maybe be the owner of the scheduled Lambda function that runs the GC process.
assets that are being referenced by these stacks. All assets that are not reached via | ||
tracing can be safely deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about which I think is not optional: how will we leave a safety margin for rollbacks?
As an example:
- A pipeline deploys revision R1 with some version of some Lambda code (
r1.zip
) - A pipeline deploys revision R2, changing the Lambda code (
r2.zip
) - Garbage collection gets run (deleting
r1.zip
) - For whatever reason, the app team wants to rollback to R1, but
r1.zip
is now gone
I guess the answer to what we need to do in GC depends on how rollback is going to work. If it's going to be a full cdk deploy
cycle we could rely on the CDK CLI to re-upload the assets... but how do CD systems work?
- In a CD system like Amazon uses internally the CFN deployment would be executed, but the asset deployment wouldn't (and so
r1.zip
would be gone without recovery) - CodePipeline doesn't currently support rollbacks, but it could in the future (and chances are it would work like Amazon's does)
All in all, you need to think about this case and think about what is reasonable to expect from a CI/CD system and what is reasonable for us to implement, and write a couple words about that.
For example, a reasonable safety precaution would be we only delete assets 30 days (or pick a number) after they stop being used. That would allow for safe interoperability even with pessimistic CI/CD scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea and something I can integrate into the RFC. I'll start with 30 days, since I'm not all that familiar with how people use cloudformation rollbacks I don't have an idea of what (if any) is a more suitable time period.
I don't think this can entirely be solved with timestamps. As I understand there are two kinds of race conditions:
- You fetch the templates when there were some in-progress stacks. If you deleted the assets cross-referencing such templates you might delete assets that would be needed if one or more of these in-progress stacks rolled-back (because the
CFn
template got would be the new one which was cause of the update). This cannot be remedied with comparing asset timestamps as it's orthogonal to that - it's about the 2 versions rather than two timestamps. - You could also race with asset-uploads. If an asset got uploaded but the changeset wasn't executed yet and your workflow ran and referenced the about to be updated template, then you would delete the asset that just got uploaded and the stack update would fail. This is better than a rollback failure of (1) as it'll be an update failure - but still a failure.
I think (2) can be remedied by using a timestamp - leave alone all assets not older than X days where X is the max likely time difference in someone's pipeline between the asset-upload stage and changeset execution stage.
For (1), assuming that is correct, there's some atomicity needed. I guess if we could atomically ask for stack-status+template in the same call, we could decide to run the workflow only if none of the stacks were in in-progress
state (as that can rollback to a template which you don't know about). Then combine it with the timestamp check of (2) so that it doesn't race with newer deployments. This will still race with create-in-progress
as such a stack might not even have existed during the scan and you will end up deleting the assets of about to be created stack. Again age-check of (2) could remedy this. If there are multiple pipelines (thus multiple stacks) deploying to the same account+region, you will need to configure X
to be the worst case max among all the pipelines reaching that stage (obviously - but just stating).
Probably not going to happen, but if we had per stack asset bucket-prefix (or entire bucket) and ecr, then we could simplify all this in case of pipeline deployments by having a post-deployment action that cleans all these assets blindly by referring to the just succeeded stack. We don't have to bother about other stacks in the account + it'll only happen after a successful update of a stack (once the stack is "stable") - so no need to worry about in-progress states. This combined with the "age" workaround might have been the best? The drawback is if you have several stacks/templates that actually reference the same asset, that would lead to asset duplication - but maybe that's OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great start, I like how you consider this a step-1/building block for gc
, which many of my clients using CDK are increasingly needing to support.
See https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/README.md for a working construct that does asset garbage collection. |
@jogold - thank you for this construct. I tried it out on my personal CDK project and it worked great! I think this is an excellent proof-of-concept and would love to see it eventually integrated into the official CDK project. |
@jogold thanks for the POC for asset garbage collection! This RFC is something that we have on our radar for later this year and when the time comes I'd be happy to iterate on this and get it into the CDK. |
hello @kaizencc / others , I've been following this for a long time and though about pinging to find out if this is now being worked upon as you mentioned that it'll happen later in the year and we are now nearing the end of the year. |
The problem I'm facing currently is that various compliance software highlight that there are insecure images in the ECR - it turns out they are pretty similar to what the automatic ECR scans provide (specially the ones labelled This ofc turns out to be a big problem for orgs which need to stick to such compliance checks, so we end up cross-referencing the CFn template and manually deleting the old images no longer in use. |
Assuming the problems mentioned here are real and not due to my lack of knowledge (please correct me otherwise, else I might just be overthinking all this) here's an algorithm that could work in practice:
In absence of being able to atomically get both the template and the stack-status, this should practically work, though it has theoretical edge cases. The most interesting here is the transition from
Same with Edge case:The rest should be practically OK. There is a theoretical chance that the latency between calls All other cases, such as you missed collecting a stack because it was This doesn't deal with Alternatives
|
I'm currently coding this algorithm for my project with an additional check to retry |
This is a request for comments about Asset Garbage Collection. See #64 for
additional details.
APIs are signed off by @rix0rrr .
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache-2.0 license