Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_ec2: seems impossible to change subnets in a vpc once deployed #28369

Open
anentropic opened this issue Dec 14, 2023 · 8 comments
Open

aws_ec2: seems impossible to change subnets in a vpc once deployed #28369

anentropic opened this issue Dec 14, 2023 · 8 comments
Labels
@aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@anentropic
Copy link

anentropic commented Dec 14, 2023

Describe the bug

I started out with a vpc containing only a PRIVATE_ISOLATED subnet

Later I needed to add PRIVATE_WITH_EGRESS and PUBLIC subnets in order to add NAT gateway

But this fails with: The CIDR '10.0.2.0/24' conflicts with another subnet

I tried a few things to get around this, eventually I tried in two stages 1) setting up a whole second vpc with the three subnets and 2) move resources to the second vpc and drop the original one

But this has problems too: AWS::RDS::DBSubnetGroup | "The new Subnets are not in the same Vpc as the existing subnet group (Service: Rds, Status Code: 400 ...)

For various reasons I cannot afford to destroy the stack and recreate it

Expected Behavior

It should be possible to modify the stack and then deploy the changes

Current Behavior

Stuck in a dead end

Reproduction Steps

  1. Create a VPC:
        vpc = ec2.Vpc(
            self,
            "VPC",
            max_azs=2,
            enable_dns_hostnames=True,
            enable_dns_support=True,
            subnet_configuration=[
                ec2.SubnetConfiguration(
                    name="Isolated subnet",
                    subnet_type=ec2.SubnetType.PRIVATE_ISOLATED,
                ),
            ],
            nat_gateways=0,
        )

and add resources like RDS MySQL db and Lambda functions to it

  1. Modify the VPC:
        vpc = ec2.Vpc(
            self,
            "VPC",
            max_azs=2,
            enable_dns_hostnames=True,
            enable_dns_support=True,
            subnet_configuration=[
                ec2.SubnetConfiguration(
                    name="Isolated subnet",
                    subnet_type=ec2.SubnetType.PRIVATE_ISOLATED,
                ),
                ec2.SubnetConfiguration(
                    name="NAT subnet",
                    subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS,
                ),
                ec2.SubnetConfiguration(
                    name="Public subnet",
                    subnet_type=ec2.SubnetType.PUBLIC,
                ),
            ],
            nat_gateways=1,
        )

I tried with and without explicitly adding cidr_mask=24 specifier to the subnets

  1. 😞

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.114.1 (build 02bbb1d)

Framework Version

2.114.1

Node.js Version

v18.18.0

OS

macOS 14.1

Language

Python

Language Version

3.11

Other information

Right now I'd just love suggestions for a workaround even if underlying issue can't be fixed any time soon

@anentropic anentropic added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 14, 2023
@github-actions github-actions bot added the @aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud label Dec 14, 2023
@pahud
Copy link
Contributor

pahud commented Dec 14, 2023

Yes the ec2.Vpc class does not have very granular control like that and you will need some customization. We had similar discussion here for your reference.

#24708 (comment)

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Dec 14, 2023
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 15, 2023
@anentropic
Copy link
Author

Yes the ec2.Vpc class does not have very granular control like that and you will need some customization. We had similar discussion here for your reference.

#24708 (comment)

@pahud I don't see how the comment in the linked issue helps me

I am not looking for more granular control, I'm just asking for CDK to be able to successfully deploy my changes

@anentropic
Copy link
Author

anentropic commented Jan 10, 2024

I ended up not needing to add new subnets to my VPC

but later I found I needed to change the CIDR of the VPC to be compatible with other VPCs in the organization (had previously used the default CIDR)

this ends up at the same problem, adding ip_addresses=ec2.IpAddresses.cidr("10.74.64.0/18"), to the VPC definition results in:

09:27:32 | UPDATE_FAILED        | AWS::RDS::DBSubnetGroup                     | DatabaseMySQLSubnetGroup9C077452
Resource handler returned message: "The new Subnets are not in the same Vpc as the existing subnet group (Service: Rds, S
tatus Code: 400, Request ID: 4beb7250-61e2-43e6-b6de-7cda9c489fce)" (RequestToken: a71eb84e-589c-2d05-2fde-c8e1b05ab64a,
HandlerErrorCode: InvalidRequest)

The problem seems to centred around the RDS db

For that I have:

        rds.DatabaseInstance(
            self,
            id="MySQL",
            engine=rds.DatabaseInstanceEngine.mysql(
                version=rds.MysqlEngineVersion.VER_8_0
            ),
            vpc=vpc,
            vpc_subnets=ec2.SubnetSelection(
                subnet_type=ec2.SubnetType.PRIVATE_ISOLATED
            ),
            ...
        )

The error makes it sound like the deployment is creating a new VPC, and new subnets, and then maybe there is an implicit SubnetGroup already created for the RDS db, and what fails is trying to put the new subnets in that existing SubnetGroup?

It sounds same cause as this issue: hashicorp/terraform-provider-aws#27459

and also hashicorp/terraform-provider-aws#16419

I will see if I can find a workaround along the lines described ... by creating a new SubnetGroup for the new VPC subnets and migrating the RDS db to use that.

@anentropic
Copy link
Author

anentropic commented Jan 10, 2024

First attempt failed:

        subnet_group = rds.SubnetGroup(
            self,
            id="MySQL DB Subnet-v1",  # increment this if you need a new subnet group for VPC changes
            description="Manually-defined SubnetGroup to allow VPC modifications.",
            vpc=vpc,
            vpc_subnets=ec2.SubnetSelection(
                subnet_type=ec2.SubnetType.PRIVATE_ISOLATED
            ),
        )
        rds_instance = rds.DatabaseInstance(
            self,
            id="MySQL",
            engine=rds.DatabaseInstanceEngine.mysql(
                version=rds.MysqlEngineVersion.VER_8_0
            ),
            vpc=vpc,
            subnet_group=subnet_group,
            ...
        )

This gives:

12:50:51 | UPDATE_FAILED        | AWS::RDS::DBProxy                           | DatabaseMySQLRDSProxyEE7D6A9D
CloudFormation cannot update a stack when a custom-named resource requires replacing. Rename websiteeudevDatabaseMySQLRDS
Proxy0574BC81 and update the stack again.

Ok right, I also have an RDS Proxy

updating this to change the id:

        rds_proxy = rds_instance.add_proxy(
            "RDS Proxy-v1", # increment this if make changes to VPC (see SubnetGroup)
            secrets=[self.db_admin_credentials],
            vpc=vpc,
            vpc_subnets=ec2.SubnetSelection(
                subnet_type=ec2.SubnetType.PRIVATE_ISOLATED
            ),
            debug_logging=proxy_debug_logging,
            security_groups=[self.db_connection_sg],
        )

...gets further and looks promising. But then fails with:

13:01:09 | UPDATE_FAILED        | Custom::VpcRestrictDefaultSG                | VPCRestrictDefault...omResource59474679
Received response status [FAILED] from custom resource. Message returned: UnauthorizedOperation: You are not authorized t
o perform this operation. User: arn:aws:sts::570110252051:assumed-role/website-eu-dev-CustomVpcRestrictDefaultSGCustomRes
o-KU7ITRpgff47/website-eu-dev-CustomVpcRestrictDefaultSGCustomRes-U9AGv76Ecy0w is not authorized to perform: ec2:Authoriz
eSecurityGroupIngress on resource: arn:aws:ec2:eu-west-1:570110252051:security-group/sg-0e96c711c3c931153 because no iden
tity-based policy allows the ec2:AuthorizeSecurityGroupIngress action.

This seems related to restrict_default_security_groups attribute of the VPC resource, which defaults to True... which seems to involve a custom resource with a Lambda function.

I guessed it might help to give the VPC a new id, to try and force it to generate a new Custom::VpcRestrictDefaultSG instead of apparently re-using the old one that didn't have permissions against the new VPC (?)

But unfortunately the previous error had left my stack in UPDATE_ROLLBACK_FAILED state, so attempts to proceed further just give:

 ❌  website-eu-dev failed: Error [ValidationError]: Stack:arn:aws:cloudformation:eu-west-1:570110252051:stack/website-eu-dev/757ef9d0-9877-11ee-a570-0e4f49b86157
is in UPDATE_ROLLBACK_FAILED state and can not be updated.

@trobert2
Copy link

I think this is true for any resources created using default values for the Vpc construct. If you have the need to replace a NAT Gateway or change an IP, it would be hard to do so as these resources are not exposed. It would be extremely useful to get access to them

@anentropic
Copy link
Author

anentropic commented Jan 11, 2024

@trobert2 I think in the end I have conflated two related but slightly different issues

My original problem was adding subnets to an existing VPC, which seems impossible currently.

I tried to work around that by creating a second VPC in CDK code and moving my resources into it, but then I hit the second problem which seems to be around the RDS db construct and subnet groups.

Later I came back and tried to change the CIDR for my VPC... this seems to implicitly create a new VPC and migrate resources to it (which is maybe what CF/CDK should do in the first case too) but that meant it ran into the second problem again, which is what I have described in more detail in comments above.

Finally I tried to work around the RDS problem (by trying to help/trick CF into doing what it needs to do) but ended up bricking my stack into a UPDATE_ROLLBACK_FAILED state. I tried to recover that via the tips here https://stackoverflow.com/a/72755589/202168 and by manually updating resources to make them "rollback-able" but it didn't seem to be getting anywhere and in the end I gave up and deleted the stack and redeployed it.

TBH I am kind of anxious about using this IaC tooling in production now.

Are there any escape hatches for e.g. providing a manual deployment plan? My understanding is that even if I manually updated all my resources into desired state via AWS web console, CF/CDK would still think they were in bad state and refuse to deploy. Is there a way to tell CDK "just assume everything's ok now" and reset its state?

IaC tooling seems conceptually similar to database migration tooling. The latter usually provides both auto-generated (from ORM models, akin to CDK code) and manually-defined migration scripts. Is there anything like that for CDK?

@trobert2
Copy link

@trobert2 I think in the end I have conflated two related but slightly different issues

My original problem was adding subnets to an existing VPC, which seems impossible currently.

I tried to work around that by creating a second VPC in CDK code and moving my resources into it, but then I hit the second problem which seems to be around the RDS db construct and subnet groups.

Later I came back and tried to change the CIDR for my VPC... this seems to implicitly create a new VPC and migrate resources to it (which is maybe what CF/CDK should do in the first case too) but that meant it ran into the second problem again, which is what I have described in more detail in comments above.

Finally I tried to work around the RDS problem (by trying to help/trick CF into doing what it needs to do) but ended up bricking my stack into a UPDATE_ROLLBACK_FAILED state. I tried to recover that via the tips here https://stackoverflow.com/a/72755589/202168 and by manually updating resources to make them "rollback-able" but it didn't seem to be getting anywhere and in the end I gave up and deleted the stack and redeployed it.

TBH I am kind of anxious about using this IaC tooling in production now.

Are there any escape hatches for e.g. providing a manual deployment plan?

IaC tooling seems conceptually similar to database migration tooling. The latter usually provides both auto-generated (from ORM models, akin to CDK code) and manually-defined migration scripts. Is there anything like that for CDK?

I see. Well in that case the issue isn't really the IaC. Even if you create a VPC manually with a CIDR, give it a subnet and create an instance in it, be that RDS or otherwise, you will have a hard time changing the network stack under the machine. Even if possible, there's a lot of operational steps to get something like this done that the tools can't really take care of for the user. I don't think CDK, Cloudformation or any other tool can help with that. Some AWS resources are just more static in nature.

I do feel like CDK could expose more of these resources so that SOME operations can be possible. A lot of those are hidden now. Like in the example with a NAT gateway, it cannot be accessed on the Vpc object.

About manually changing those resources then expecting Cloudformation (which lies under CDK) to keep track of it, in a more general sense. You can see drift and how far the tooling around that helps, but the declarative nature of the system means it expects changes to happen in code first. You can set the retain policy on the resources you care about and just delete the stack: removal_policy=RemovalPolicy.RETAIN. Then maybe you can redesign your stack and create a new one that imports resources not tracked in other stacks anymore. This is advanced stuff. Thread carefully.

@anentropic
Copy link
Author

anentropic commented Jan 11, 2024

I see. Well in that case the issue isn't really the IaC. Even if you create a VPC manually with a CIDR, give it a subnet and create an instance in it, be that RDS or otherwise, you will have a hard time changing the network stack under the machine. Even if possible, there's a lot of operational steps to get something like this done that the tools can't really take care of for the user. I don't think CDK, Cloudformation or any other tool can help with that.

I'm not sure that's entirely true. The issues I ran into mostly look like bugs.

e.g. in the first issue CF/CDK said it couldn't do anything, but it seems like it actually needed to set up a new VPC and migrate resources over, which in other cases it is apparently willing to do.

and then when it does do that there is the second issue where one of the constructs does not cope with a detail of that migration, when it seems like it could do (e.g. tweaking the CDK code to get it to think about the changes differently allows for progress)

I don't mind making these tweaks, or splitting my changes across 2-3 phases. Similar things are required in case of some database migrations.

the more worrying part is when the rollback didn't work and it got in an unrecoverable state - this is essentially a second symptom of the bug which prevented the deployment from succeeding, i.e. some of the resources don't know how to cope with the changes that other resources are making (e.g. they make assumptions they shouldn't about what doesn't change)

and I'm not sure how useful the drift detection is, since when my stack was in UPDATE_ROLLBACK_FAILED state the drift detection said everything was "in sync" - it didn't provide any signals about how to get back in a state that can be rolled back or forward

Instead of deleting the whole stack I could have marked say the S3 bucket and RDS db with RemovalPolicy.RETAIN, then written a version of the CDK stack that imported these as pre-existing resources. But then I'm stuck like that forever aren't I? i.e. I can't go back to a version of the stack which defines the S3 bucket and RDS db from scratch without blowing them away and recreating them

But it defeats the point of IaC a bit if I have to manage half the stack manually because it's not safe to let the IaC manage it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ec2 Related to Amazon Elastic Compute Cloud bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

No branches or pull requests

3 participants