Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ec2): default Vpc structure results in broken networking during create/delete #21348

Closed
laurelmay opened this issue Jul 28, 2022 · 6 comments · Fixed by #21495
Closed

(ec2): default Vpc structure results in broken networking during create/delete #21348

laurelmay opened this issue Jul 28, 2022 · 6 comments · Fixed by #21495
Labels
@aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@laurelmay
Copy link
Contributor

laurelmay commented Jul 28, 2022

Describe the bug

When creating a VPC with the default configuration, in some situations, the NAT Gateway(s) may be torn down before the private subnets are. This can cause issues for resources that rely on network egress in order to successfully (such as a Lambda-backed CloudFormation Custom Resource). Additionally, the Private subnets (and resources that depend on them) may be created before the NAT Gateways. This can result in broken initialization logic.

Expected Behavior

It should not be possible for the NAT Gateway to be deleted before the private subnets. The private subnets should depend on the Gateway.

Current Behavior

The NAT Gateway resource, the public subnets, and the internet gateway are deleted.

Reproduction Steps

import * as cdk from "aws-cdk-lib";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import * as lambda from "aws-cdk-lib/aws-lambda";

const app = new cdk.App();
const stack = new cdk.Stack(app, "TestStack");
const vpc = new ec2.Vpc(stack, "TestVpc");

const fn = new lambda.Function(stack, "TestFn", {
  vpc,
  vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_NAT },
  // Other stuff to require internet access
});

Building out the necessary infrastructure from here to actually create a custom resource to demonstrate is somewhat non-trivial and would e able to be copied/pasted.

Possible Solution

Whether directly or indirectly, the private subnet should depend on the NAT Gateway. This could also be done by depending on the route. This should be handled already in configureSubnet.

public configureSubnet(subnet: PrivateSubnet) {
const az = subnet.availabilityZone;
const gatewayId = this.gateways.pick(az);
subnet.addRoute('DefaultRoute', {
routerType: RouterType.NAT_GATEWAY,
routerId: gatewayId,
enablesInternetConnectivity: true,
});

But the Route Table Association has the references to the Subnet and the Route Table, and the Routes reference the Route Table and the target (NAT Gateway). But there's no reference from the Subnet to any of those resources, so there is no implicit dependency, which breaks network egress for resources in private subnets as the stack is being deleted.

Maybe this issue is limited to Lambda Functions instead of being a larger problem with EC2? Do more resources need to add a dependency on Vpc.internetConnectivityEstablished?

Additional Information/Context

No response

CDK CLI Version

2.33.0

Framework Version

No response

Node.js Version

16

OS

Linux

Language

Typescript

Language Version

No response

Other information

No response

@laurelmay laurelmay added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 28, 2022
@github-actions github-actions bot added the @aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud label Jul 28, 2022
@laurelmay laurelmay changed the title (ec2): Private subnets don't depend on NAT Gateway (ec2): default Vpc structure results in broken networking during create/delete Jul 28, 2022
@corymhall
Copy link
Contributor

@kylelaker it sounds like this may be an issue specific to CloudFormation custom resources since
those are the only types of resources that you would care about successfully executing while
destroying the stack.

I think the solution here would be to specify the dependency on internetConnectivityEstablished
within the lambda.

This issue has been classified as p2. That means a workaround is available or it is deemed a nice-to-have feature. Given the amount of work there is to do and the relative priority of this issue, the CDK team is unlikely to address it. That does not mean the issue will never be fixed! If someone from the community submits a PR to fix this issue, and the PR is small and straightforward enough, and meets the quality bars to be reviewed and merged with little effort we will accept that PR. PRs that do not build or need complex or multiple rounds of reviews are unlikely to be merged and will be closed to keep our backlog manageable.

In the mean time, remember that you can always use the escape hatch (https://docs.aws.amazon.com/cdk/v2/guide/cfn_layer.html) mechanism to have fine control over the CloudFormation output you want. We will keep the issue open for discoverability, to collect upvotes, and to facilitate discussion around this topic.

We use +1s on this issue to help prioritize our work, and are happy to re-evaluate the prioritization of this issue based on community feedback. You can reach out to the cdk.dev community on Slack (https://cdk.dev/) to solicit support for reprioritization.

@corymhall corymhall added effort/medium Medium work item – several days of effort p2 and removed needs-triage This issue or PR still needs to be triaged. labels Aug 1, 2022
@corymhall corymhall removed their assignment Aug 1, 2022
@laurelmay
Copy link
Contributor Author

laurelmay commented Aug 8, 2022

So I think that's part of it. But there's something even weirder happening when specifying a dependency on internetConnectivityEstablished for a private subnet. For example:

#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from "aws-cdk-lib";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import * as lambda from "aws-cdk-lib/aws-lambda";

const app = new cdk.App();
const stack = new cdk.Stack(app, "Ec2PrivateStack");
const vpc = new ec2.Vpc(stack, "SampleVpc");
const vpcSubnets = { subnetType: ec2.SubnetType.PRIVATE_WITH_NAT };

const fn = new lambda.SingletonFunction(stack, "Function", {
  code: lambda.Code.fromInline(`module.exports = { handler: function() { console.log("Hello, world"); } }`),
  runtime: lambda.Runtime.NODEJS_16_X,
  handler: "index.handler",
  uuid: "fa5088ab-850a-4106-a657-482d78dccf68",
  lambdaPurpose: "Hello",
  vpc,
  vpcSubnets,
});

fn.addDependency(vpc.selectSubnets(vpcSubnets).internetConnectivityEstablished);

So the dependency on the private routes gets set, but it seems like other dependencies aren't really happening in a way that results in success. It works for creation but then the stack fails to delete. You can see that the public subnet resources start to get deleted immediately. So the private subnet routes and the NAT Gateway may still exist but they have no routes to the IGW. And from there because the Lambda function takes some time to delete (well, it actually deletes pretty quickly but CloudFormation takes ~20min to notice but that feels like an unrelated CloudFormation bug) other resources keep deleting in weird orders and the whole process ends up failing with a lingering Internet Gateway and VPC with "dependent resources".

This means the given workaround does not work. And it makes #21357 pretty impossible to implement correctly.

I think the fix here would be to make the CfnNatGateway defined at

public addNatGateway(eipAllocationId?: string) {
// Create a NAT Gateway in this public subnet
const ngw = new CfnNatGateway(this, 'NATGateway', {
subnetId: this.subnetId,
allocationId: eipAllocationId ?? new CfnEIP(this, 'EIP', {
domain: 'vpc',
}).attrAllocationId,
});
return ngw;
}
depend not only on the subnet (via the implicit dependency caused by the reference to its ID) but also to add an explicit dependency on this.internetConnectivityEstablished; otherwise every resource that needs internet connectivity in a PRIVATE_WITH_NAT subnet also has to add a dependency on the public subnets that have NAT Gateways which is a hard filter to add.

The stack events from building and deleting this stack can be seen at https://gist.github.com/b4ecd79a8be6e57e229a34ccdc34e220. I think I ran that command a little to early so the final VPC delete failure isn't there but I think the point is clear.

I will open a PR to add the dependency to the NAT Gateway but let me know if that's not the right approach.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 8, 2022

I would need to see the dependency structure here to see where we can add a dependency line most conveniently without introducing a cycle.

I'm guessing right now internetConnectivityEstablished results in a dependency on the route tables... but I'm not exactly sure how the CFN resources are laid out.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 8, 2022

I believe it's this.

  • For public subnets (and public subnets only), internetConnectivityEstablished contains the IGW, which is actually unnecessary (the dependency is transitively implied by the dependency on the route).
  • internetConnectivityEstablished is missing a dependency on the SubnetRouteTableAssociation. Without this dependency, the route table can be in pristine condition, but be detached from the subnet (and hence not have effect anymore). It's interesting this doesn't happen more often.
  • And finally, custom resources should add a dependency from the Function to the internetConnectivityEstablished of their subnets, when added to a VPC.

I guess the safe solution is to have Function add the dependency to itself by default. I'm a little concerned about pessimizing deployment speed for Functions that don't need to execute during the deployment (as basically only Custom Resources need to do this). But the lost safety might not be worth the micro-optimization.

image

@laurelmay
Copy link
Contributor Author

I guess the safe solution is to have Function add the dependency to itself by default. I'm a little concerned about pessimizing deployment speed for Functions that don't need to execute during the deployment (as basically only Custom Resources need to do this).

I tried to think through this to see if I could identify a better solution than doing this for all Functions; but while this can be done for Provider and AwsCustomResource (well, that one only matters if #21357 is merged), there's still CustomResourceProvider/CustomResource in core (which are primarily used in cases where adding VpcConfig would be tricky anyway) that are less important.

But then I realized that theoretically a Lambda function that is used as a Custom Resource (and therefore would get the dependency via Provider) could end up invoking other Lambda functions (either directly or indirectly) that are also then going to (potentially) require internet connectivity.

And it feels feasible to document "if you use a Function directly or indirectly as part of a Custom Resource, make sure to add a dependency on internet connectivity". But that seems like a lot of unnecessary copying and pasting (and potential uncertainty).

So I agree that just doing the safer thing by default feels most correct and is closer to what I'd expect as an end user.

@mergify mergify bot closed this as completed in #21495 Aug 22, 2022
mergify bot pushed a commit that referenced this issue Aug 22, 2022
…21495)

Because private subnets rely on a NAT Gateway for internet connectivity,
it is important that the NAT Gateway have the necessary dependencies for
its own internet connectivity. Otherwise,
`internetConnectivityEstablished` on a private subnet may not be true
during stack creation and deletion. This is most notable for
CloudFormaton Custom Resources; however, it can result in other
dependency failures during stack deletion, especially if resources
within a private subnet take a long time to delete.

Ensuring that the NAT Gateway depends on its public subnet having
internet connectivity completes the chain of dependencies and ensures
that all resources will correctly have internet connectivity.

Because of the layers of abstraction around subnets and NAT gateways,
unit tests for this feature are challenging (because there isn't a clear
means to get the CloudFormaton Logical ID of the AWS::EC2::Route that
establishes the connectivity); however, NAT Gateways are included in
several integration tests so this dependency can be tested there.

Closes: #21348

----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [ ] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

josephedward pushed a commit to josephedward/aws-cdk that referenced this issue Aug 30, 2022
…ws#21495)

Because private subnets rely on a NAT Gateway for internet connectivity,
it is important that the NAT Gateway have the necessary dependencies for
its own internet connectivity. Otherwise,
`internetConnectivityEstablished` on a private subnet may not be true
during stack creation and deletion. This is most notable for
CloudFormaton Custom Resources; however, it can result in other
dependency failures during stack deletion, especially if resources
within a private subnet take a long time to delete.

Ensuring that the NAT Gateway depends on its public subnet having
internet connectivity completes the chain of dependencies and ensures
that all resources will correctly have internet connectivity.

Because of the layers of abstraction around subnets and NAT gateways,
unit tests for this feature are challenging (because there isn't a clear
means to get the CloudFormaton Logical ID of the AWS::EC2::Route that
establishes the connectivity); however, NAT Gateways are included in
several integration tests so this dependency can be tested there.

Closes: aws#21348

----

### All Submissions:

* [X] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [ ] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/main/INTEGRATION_TESTS.md)?
	* [ ] Did you use `yarn integ` to deploy the infrastructure and generate the snapshot (i.e. `yarn integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants