-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1Pager: stateful resources #398
Changes from 7 commits
231b4e8
95d66f9
b0b5a0b
3d250bf
8e98f06
c5d463a
5d49c27
5059d26
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,258 @@ | ||||||
# 1Pager | Modeling Stateful Resources | ||||||
|
||||||
This document is a one pager meant to surface the idea of explicitly modeling | ||||||
stateful resources in the core framework. Its goal is to conduct a preliminary | ||||||
discussion and gather feedback before it potentially progresses into a full blown RFC. | ||||||
|
||||||
## Customer pain | ||||||
|
||||||
Customers are currently exposed to unintentional data loss when stateful resources | ||||||
are designated for either removal or replacement by CloudFormation. Data, similarly | ||||||
to security postures, is an area in which even a single rare mistake can | ||||||
cause catastrophic outages. | ||||||
|
||||||
The CDK can and should help reduce the risk of such failures. With respect to | ||||||
security, the CDK currently defaults to blocking deployments that contain | ||||||
changes in security postures, requiring a user confirmation: | ||||||
|
||||||
```console | ||||||
This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening). | ||||||
Please confirm you intend to make the following modifications: | ||||||
... | ||||||
... | ||||||
Do you wish to deploy these changes (y/n)? | ||||||
``` | ||||||
|
||||||
However, no such mechanism exists for changes that might result in data loss, | ||||||
i.e removal or replacement of stateful resources, such as `S3` buckets, `DynamoDB` tables, etc... | ||||||
|
||||||
## Failure scenarios | ||||||
|
||||||
To understand how susceptible customers are to this, we outline a few | ||||||
scenarios where such data loss can occur. | ||||||
|
||||||
### Stateful resource without `DeletionPolicy/UpdateReplacePolicy` | ||||||
|
||||||
By default, CloudFormation will **delete** resources that are removed from the stack, | ||||||
or when a property that requires replacement is changed. | ||||||
|
||||||
> - [UpdateReplacePolicy](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatereplacepolicy.html) | ||||||
> - [DeletionPolicy](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-deletionpolicy.html) | ||||||
|
||||||
To retain stateful resources, authors must remember to configure those policies | ||||||
with the `Retain` value. Having to remember this makes it also easy to forget. | ||||||
This means that stateful resources might be shipped with the incorrect policy. | ||||||
|
||||||
### Policy Change | ||||||
|
||||||
CDK applications are often comprised out of many third-party resources. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really understand what this paragraph is trying to say. |
||||||
Even if a third-party resource is initially shipped with the correct policy, this may change. | ||||||
Whether or not the policy change was intentional is somewhat irrelevant, it can | ||||||
still be undesired and have dire implications on the consuming application. | ||||||
|
||||||
For that matter, even policy changes made by the application author itself | ||||||
can be unexpected or undesired. | ||||||
|
||||||
## Desired Customer Experience | ||||||
|
||||||
The experience described here lays under the following assumptions: | ||||||
|
||||||
1. The correct policy for stateful resources is always `Retain`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this be more appropriate? Presumably there are cases users want a stateful resource without "Retain", such as buckets holding cached data that can be easily re-generated, etc. -- otherwise, we shouldn't expose it in the L2 at all.
Suggested change
|
||||||
2. With data loss, its better to ask for permission than ask for forgiveness. | ||||||
|
||||||
As mentioned before, we should provide an experience similar to the one we have | ||||||
with respect to security posture changes. We propose that when deploying or | ||||||
destroying a stack, in case it causes the removal of stateful resources, the user will see: | ||||||
|
||||||
```console | ||||||
The following stateful resources will be removed: | ||||||
|
||||||
- MyBucket (Removed from stack) | ||||||
- MyDynamoTable (Changing property `TableName` requires a replacement) | ||||||
|
||||||
Do you wish to continue (y/n)? | ||||||
``` | ||||||
|
||||||
> Note that we want to warn the users about possible replacements as well, not just removals. | ||||||
|
||||||
If this is desired, the user will confirm. And if this behavior is expected to repeat, | ||||||
the user can use a new `allow-removal` flag, to skip the interactive confirmation: | ||||||
|
||||||
`cdk deploy --allow-removal MyBucket --allow-removal MyDynamoTable` | ||||||
|
||||||
An additional feature we can provide is to detect when a deployment will change the policy of stateful resources. i.e: | ||||||
|
||||||
```console | ||||||
The removal policy of the following stateful resources will change: | ||||||
|
||||||
- MyBucket (RETAIN -> DESTROY) | ||||||
- MyDynamoTable (RETAIN -> DESTROY) | ||||||
|
||||||
Do you wish to continue (y/n)? | ||||||
``` | ||||||
|
||||||
Blocking these types of changes will keep CloudFormation templates in a safe state. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CloudFormation Stack Policies will be able to prevent updates. We would mark certain resources as "protected" in the CDK application. Of course, we will need a way to indicate that you are okay with making certain changes.
|
||||||
Otherwise, if such changes are allowed to be deployed, out of band operations | ||||||
via CloudFormation will still pose a risk. | ||||||
|
||||||
## High Level Design | ||||||
|
||||||
In order to provide any of the features described above, the core framework must be able | ||||||
to differentiate between stateful and stateless resources. | ||||||
|
||||||
> **Side note**: Explicitly differentiating stateful resources from stateless ones is a | ||||||
> common practice in infrastructure modeling. For example, Kubernetes | ||||||
> uses the [`StatefulSet`](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) | ||||||
> resource for this. | ||||||
> And terraform has the [`prevent_destroy`](https://www.terraform.io/language/meta-arguments/lifecycle#prevent_destroy) property to mark stateful resources. | ||||||
Comment on lines
+103
to
+107
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Useful context 👍 |
||||||
|
||||||
### API | ||||||
|
||||||
We'd like to extend our API so it enforces every resource marks itself as either | ||||||
stateful or stateless. To that end, we propose adding the following properties to | ||||||
the `CfnResourceProps` interface: | ||||||
|
||||||
```ts | ||||||
export interface CfnResourceProps { | ||||||
/** | ||||||
* Whether or not this resource is stateful. | ||||||
*/ | ||||||
readonly stateful: boolean; | ||||||
|
||||||
/** | ||||||
* Which properties cause replacement of this resource. | ||||||
* Required if the resource is stateful, encouraged otherwise. | ||||||
* | ||||||
* @default undefined | ||||||
*/ | ||||||
readonly replacedIf?: string[]; | ||||||
} | ||||||
``` | ||||||
|
||||||
As well as corresponding getters: | ||||||
|
||||||
```ts | ||||||
/** | ||||||
* Represents a CloudFormation resource. | ||||||
*/ | ||||||
export class CfnResource extends CfnRefElement { | ||||||
|
||||||
public get stateful: boolean; | ||||||
|
||||||
public get replacedIf?: string[]; | ||||||
|
||||||
constructor(scope: Construct, id: string, props: CfnResourceProps) { | ||||||
super(scope, id); | ||||||
this.stateful = props.stateful; | ||||||
this.replacedIf = props.replacedIf; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think |
||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
### CodeGen | ||||||
|
||||||
Adding these properties will break the generated L1 resources. We need to add some | ||||||
logic into `cfn2ts` so that it generates constructors that pass the appropriate values. | ||||||
For that, we manually curate a list of stateful resources and check that into our source code, i.e: | ||||||
|
||||||
+ stateful-resources.json | ||||||
|
||||||
> The key is the resource type, and the value is the list of properties that require replacement. | ||||||
|
||||||
```json | ||||||
{ | ||||||
"AWS::S3::Bucket": ["BucketName"], | ||||||
"AWS::DynamoDB::Table": ["TableName"], | ||||||
} | ||||||
``` | ||||||
|
||||||
Using this file, we can generate the correct constructor, for example: | ||||||
|
||||||
```ts | ||||||
export class CfnBucket extends cdk.CfnResource implements cdk.IInspectable { | ||||||
|
||||||
constructor(scope: cdk.Construct, id: string, props: CfnBucketProps = {}) { | ||||||
super(scope, id, { | ||||||
stateful: true, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we go the "stack policies" route, we need a way to turn off the protection, in a way that is user controllable (when they finally do want to get rid of a stateful resource). Maybe it's a sort of
When you want to get rid of it, you switch to |
||||||
replacedIf: ["BucketName"], | ||||||
type: CfnBucket.CFN_RESOURCE_TYPE_NAME, | ||||||
properties: props | ||||||
}); | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
### Synthesis | ||||||
|
||||||
Now that all L1 resources are correctly marked and can be inspected, we add | ||||||
logic to the synthesis process to write this metadata to the cloud assembly. | ||||||
For example, a stack called `my-stack`, defining an `S3` bucket with | ||||||
id `MyBucket`, will contain the following in `manifest.json`: | ||||||
|
||||||
```json | ||||||
{ | ||||||
"metadata": { | ||||||
"/my-stack/MyBucket/Resource": [ | ||||||
{ | ||||||
"stateful": true, | ||||||
"replacedIf": ["BucketName"], | ||||||
} | ||||||
], | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
### Deployment | ||||||
|
||||||
During deployment, we inspect the diff and identify which changes are about | ||||||
to occur on stateful resources in the stack. This closes the loop and should | ||||||
allow for the [customer experience](#desired-customer-experience) we described. | ||||||
|
||||||
## Q & A | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There might not be enough room to include this since it's not a full RFC, but I think it could be helpful to add a short section on what alternatives were considered to addressing the customer pain and why they were not chosen over the suggested solution. These could be alternatives to the proposed CLI experience, or alternatives to the proposed API (like adding a |
||||||
|
||||||
### Why do we need the API change given we have the curated file? | ||||||
|
||||||
Its true that the existence of the file allows us to codegen marked L1 resources | ||||||
Comment on lines
+213
to
+215
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify what "the curated file" refers to, since this is a bit confusing if you read the document out of order |
||||||
without needing to make changes to the input API. However, this won't | ||||||
cover the use-case of **stateful custom resources**. As we know, custom resources may | ||||||
also be stateful. This scenario is in fact what sparked this discussion. | ||||||
|
||||||
> See [s3: toggling off auto_delete_objects for Bucket empties the bucket](https://github.com/aws/aws-cdk/issues/16603) | ||||||
|
||||||
Since every [CustomResource] eventually instantiates a `CfnResource`, | ||||||
changing this API will also make sure custom resources are marked appropriately. | ||||||
|
||||||
In addition, enforcing this on the API level will also be beneficial for third-party | ||||||
constructs that might be extending from `CfnResource`. | ||||||
|
||||||
[CustomResource]: https://github.com/aws/aws-cdk/blob/master/packages/@aws-cdk/core/lib/custom-resource.ts#L123 | ||||||
|
||||||
### What about CDK Pipelines? | ||||||
|
||||||
This document only refers to a deployment workflow using `cdk deploy`. I have yet | ||||||
to research how this would be done in CDK pipelines. My initial thoughts are to somehow incorporate | ||||||
`cdk diff` in the pipeline, along with a manual approval stage. | ||||||
|
||||||
### Would this have prevented [#16603](https://github.com/aws/aws-cdk/issues/16603)? | ||||||
|
||||||
Yes. If the API proposed here existed, we would have passed `stateful: true` in the | ||||||
implementation of the `autoDeleteObjects` property. Then, when the resource was | ||||||
removed due to the toggle, we would identify this as a removal of a stateful resource, and block it. | ||||||
|
||||||
### Curating a list of stateful resources is a maintenance burden - can we do it differently? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifying the file should be straightforward and infrequent. So I think it's only a burden to the extent that people can forget to update it when creating a new stateful resource. Is there any mechanism to remind developers to check this (like a lint rule)? |
||||||
|
||||||
We might be able to leverage the work already done by cfn-lint, i.e [StatefulResources.json]. | ||||||
However, we also need the list of properties that require replacement for these resources. | ||||||
We can probably collaborate with cfn-lint on this. | ||||||
|
||||||
[StatefulResources.json]: https://github.com/aws-cloudformation/cfn-lint/blob/main/src/cfnlint/data/AdditionalSpecs/StatefulResources.json | ||||||
|
||||||
### Isn't this a breaking change? How are we supposed to roll it out? | ||||||
|
||||||
Yes. Making the `stateful` property required will break some customers. | ||||||
There are two options I can think of: | ||||||
|
||||||
1. Release it as v3. We should in general consider not being so hesitant about releasing new major versions. | ||||||
We've seen with v2 that as long as migrations are not too painful, our customers are willing to | ||||||
perform such upgrades. | ||||||
2. Make the `stateful` property optional and perform runtime validations controlled by a feature flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this has been implemented in aws/aws-cdk#12920
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look.