Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1Pager: stateful resources #398

Closed
wants to merge 8 commits into from
Closed
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions text/stateful-resources-one-pager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# 1Pager | Modeling Stateful Resources

This document is a one pager meant to surface the idea of explicitly modeling
stateful resources in the core framework. Its goal is to conduct a preliminary
discussion and gather feedback before it potentially progresses into a full blown RFC.

## Customer pain

Customers are currently exposed to unintentional data loss when stateful resources
are designated for either removal or replacement by CloudFormation. Data, similarly
to security postures, is an area in which even a single rare mistake can
cause catastrophic outages.

The CDK can and should help reduce the risk of such failures. With respect to
security, the CDK currently defaults to blocking deployments that contain
changes in security postures, requiring a user confirmation:

```console
This deployment will make potentially sensitive changes according to your current security approval level (--require-approval broadening).
Please confirm you intend to make the following modifications:
...
...
Do you wish to deploy these changes (y/n)?
```

However, no such mechanism exists for changes that might result in data loss,
i.e removal or replacement of stateful resources, such as `S3` buckets, `DynamoDB` tables, etc...

## Failure scenarios

To understand how susceptible customers are to this, we outline a few
scenarios where such data loss can occur.

### Stateful resource without `DeletionPolicy/UpdateReplacePolicy`

By default, CloudFormation will **delete** resources that are removed from the stack,
or when a property that requires replacement is changed.

> - [UpdateReplacePolicy](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatereplacepolicy.html)
> - [DeletionPolicy](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-deletionpolicy.html)

To retain stateful resources, authors must remember to configure those policies
with the `Retain` value. Having to remember this makes it also easy to forget.
This means that stateful resources might be shipped with the incorrect policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this has been implemented in aws/aws-cdk#12920

Copy link
Contributor Author

@iliapolo iliapolo Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look.


### Policy Change

CDK applications are often comprised out of many third-party resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this paragraph is trying to say.

Even if a third-party resource is initially shipped with the correct policy, this may change.
Whether or not the policy change was intentional is somewhat irrelevant, it can
still be undesired and have dire implications on the consuming application.

For that matter, even policy changes made by the application author itself
can be unexpected or undesired.

## Desired Customer Experience

The experience described here lays under the following assumptions:

1. The correct policy for stateful resources is always `Retain`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be more appropriate? Presumably there are cases users want a stateful resource without "Retain", such as buckets holding cached data that can be easily re-generated, etc. -- otherwise, we shouldn't expose it in the L2 at all.

Suggested change
1. The correct policy for stateful resources is always `Retain`.
1. The correct default policy for stateful resources is always `Retain`.

2. With data loss, its better to ask for permission than ask for forgiveness.

As mentioned before, we should provide an experience similar to the one we have
with respect to security posture changes. We propose that when deploying or
destroying a stack, in case it causes the removal of stateful resources, the user will see:

```console
The following stateful resources will be removed:

- MyBucket (Removed from stack)
- MyDynamoTable (Changing property `TableName` requires a replacement)

Do you wish to continue (y/n)?
```

> Note that we want to warn the users about possible replacements as well, not just removals.

If this is desired, the user will confirm. And if this behavior is expected to repeat,
the user can use a new `allow-removal` flag, to skip the interactive confirmation:

`cdk deploy --allow-removal MyBucket --allow-removal MyDynamoTable`

An additional feature we can provide is to detect when a deployment will change the policy of stateful resources. i.e:

```console
The removal policy of the following stateful resources will change:

- MyBucket (RETAIN -> DESTROY)
- MyDynamoTable (RETAIN -> DESTROY)

Do you wish to continue (y/n)?
```

Blocking these types of changes will keep CloudFormation templates in a safe state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloudFormation Stack Policies will be able to prevent updates. We would mark certain resources as "protected" in the CDK application.

Of course, we will need a way to indicate that you are okay with making certain changes.

  • The simplest solution is to stop marking resources as "protected" in the template, and do a regular deployment.
    • In fact, in practice it would take 2 deployments: (1) switch off the protection (2) remove the resource.
  • Slightly more advanced would be a CLI confirmation prompt, but then the CLI needs to be able to look into and parse and modify stack policies. Also it needs extra effort in a CI/CD pipeline.

Otherwise, if such changes are allowed to be deployed, out of band operations
via CloudFormation will still pose a risk.

## High Level Design

In order to provide any of the features described above, the core framework must be able
to differentiate between stateful and stateless resources.

> **Side note**: Explicitly differentiating stateful resources from stateless ones is a
> common practice in infrastructure modeling. For example, Kubernetes
> uses the [`StatefulSet`](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)
> resource for this.
> And terraform has the [`prevent_destroy`](https://www.terraform.io/language/meta-arguments/lifecycle#prevent_destroy) property to mark stateful resources.
Comment on lines +103 to +107
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful context 👍


### API

We'd like to extend our API so it enforces every resource marks itself as either
stateful or stateless. To that end, we propose adding the following properties to
the `CfnResourceProps` interface:

```ts
export interface CfnResourceProps {
/**
* Whether or not this resource is stateful.
*/
readonly stateful: boolean;

/**
* Which properties cause replacement of this resource.
* Required if the resource is stateful, encouraged otherwise.
*
* @default undefined
*/
readonly replacedIf?: string[];
}
```

As well as corresponding getters:

```ts
/**
* Represents a CloudFormation resource.
*/
export class CfnResource extends CfnRefElement {

public get stateful: boolean;

public get replacedIf?: string[];

constructor(scope: Construct, id: string, props: CfnResourceProps) {
super(scope, id);
this.stateful = props.stateful;
this.replacedIf = props.replacedIf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think replacedIf is probably unnecessary when using stack policies.

}
}
```

### CodeGen

Adding these properties will break the generated L1 resources. We need to add some
logic into `cfn2ts` so that it generates constructors that pass the appropriate values.
For that, we manually curate a list of stateful resources and check that into our source code, i.e:

+ stateful-resources.json

> The key is the resource type, and the value is the list of properties that require replacement.

```json
{
"AWS::S3::Bucket": ["BucketName"],
"AWS::DynamoDB::Table": ["TableName"],
}
```

Using this file, we can generate the correct constructor, for example:

```ts
export class CfnBucket extends cdk.CfnResource implements cdk.IInspectable {

constructor(scope: cdk.Construct, id: string, props: CfnBucketProps = {}) {
super(scope, id, {
stateful: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go the "stack policies" route, we need a way to turn off the protection, in a way that is user controllable (when they finally do want to get rid of a stateful resource).

Maybe it's a sort of removalPolicy though:

  • RemovalPolicy.DESTROY
  • RemovalPolicy.RETAIN
  • RemovalPolicy.PROTECTED

When you want to get rid of it, you switch to RemovalPolicy.DESTROY first, then remove it.

replacedIf: ["BucketName"],
type: CfnBucket.CFN_RESOURCE_TYPE_NAME,
properties: props
});
}
}
```

### Synthesis

Now that all L1 resources are correctly marked and can be inspected, we add
logic to the synthesis process to write this metadata to the cloud assembly.
For example, a stack called `my-stack`, defining an `S3` bucket with
id `MyBucket`, will contain the following in `manifest.json`:

```json
{
"metadata": {
"/my-stack/MyBucket/Resource": [
{
"stateful": true,
"replacedIf": ["BucketName"],
}
],
}
}
```

### Deployment

During deployment, we inspect the diff and identify which changes are about
to occur on stateful resources in the stack. This closes the loop and should
allow for the [customer experience](#desired-customer-experience) we described.

## Q & A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might not be enough room to include this since it's not a full RFC, but I think it could be helpful to add a short section on what alternatives were considered to addressing the customer pain and why they were not chosen over the suggested solution. These could be alternatives to the proposed CLI experience, or alternatives to the proposed API (like adding a IStateful interface instead), etc.


### Why do we need the API change given we have the curated file?

Its true that the existence of the file allows us to codegen marked L1 resources
Comment on lines +213 to +215
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify what "the curated file" refers to, since this is a bit confusing if you read the document out of order

without needing to make changes to the input API. However, this won't
cover the use-case of **stateful custom resources**. As we know, custom resources may
also be stateful. This scenario is in fact what sparked this discussion.

> See [s3: toggling off auto_delete_objects for Bucket empties the bucket](https://github.com/aws/aws-cdk/issues/16603)

Since every [CustomResource] eventually instantiates a `CfnResource`,
changing this API will also make sure custom resources are marked appropriately.

In addition, enforcing this on the API level will also be beneficial for third-party
constructs that might be extending from `CfnResource`.

[CustomResource]: https://github.com/aws/aws-cdk/blob/master/packages/@aws-cdk/core/lib/custom-resource.ts#L123

### What about CDK Pipelines?

This document only refers to a deployment workflow using `cdk deploy`. I have yet
to research how this would be done in CDK pipelines. My initial thoughts are to somehow incorporate
`cdk diff` in the pipeline, along with a manual approval stage.

### Would this have prevented [#16603](https://github.com/aws/aws-cdk/issues/16603)?

Yes. If the API proposed here existed, we would have passed `stateful: true` in the
implementation of the `autoDeleteObjects` property. Then, when the resource was
removed due to the toggle, we would identify this as a removal of a stateful resource, and block it.

### Curating a list of stateful resources is a maintenance burden - can we do it differently?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the file should be straightforward and infrequent. So I think it's only a burden to the extent that people can forget to update it when creating a new stateful resource. Is there any mechanism to remind developers to check this (like a lint rule)?


We might be able to leverage the work already done by cfn-lint, i.e [StatefulResources.json].
However, we also need the list of properties that require replacement for these resources.
We can probably collaborate with cfn-lint on this.

[StatefulResources.json]: https://github.com/aws-cloudformation/cfn-lint/blob/main/src/cfnlint/data/AdditionalSpecs/StatefulResources.json

### Isn't this a breaking change? How are we supposed to roll it out?

Yes. Making the `stateful` property required will break some customers.
There are two options I can think of:

1. Release it as v3. We should in general consider not being so hesitant about releasing new major versions.
We've seen with v2 that as long as migrations are not too painful, our customers are willing to
perform such upgrades.
2. Make the `stateful` property optional and perform runtime validations controlled by a feature flag.