Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: IPAM allocation fails with "InvalidIpamPoolAllocationId" #28913

Closed
AlexBarth13 opened this issue Jan 16, 2023 · 18 comments · Fixed by #29022
Closed

[Bug]: IPAM allocation fails with "InvalidIpamPoolAllocationId" #28913

AlexBarth13 opened this issue Jan 16, 2023 · 18 comments · Fixed by #29022
Assignees
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/ipam Issues and PRs that pertain to the ipam service.
Milestone

Comments

@AlexBarth13
Copy link

AlexBarth13 commented Jan 16, 2023

Terraform Core Version

1.3.2

AWS Provider Version

4.32.0 and 4.50.0

Affected Resource(s)

aws_vpc_ipam_pool_cidr_allocation

Expected Behavior

I expected that the IPAM allocation will be created successfully.

Actual Behavior

In our environment we are using a multi-account setup. The IPAM pools are created in one account and shared with RAM to another account. We are running in the below mentioned issue when we want to allocate an CIDR in the shared IPAM pool.

To create our IPAM pool allocation we are using this snippet in our code:

resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {
   count = var.cf_subnet_infra_count
   ipam_pool_id = var.ipam_pool_id
   netmask_length = 27
}

But immediately afterwards we get the following error:

Error: InvalidIpamPoolAllocationId.NotFound: The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.
status code: 400, request id: 9683f21c-8972-4c40-8227-72f5c219e5d3

with aws_vpc_ipam_pool_cidr_allocation.vpc-ipam-pool-alloc-cidr-cf-subnet-infra[0],
on ipam_pool_allocations.tf line 1, in resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra":
1: resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {

When we run "aws ec2 get-ipam-pool-allocations --ipam-pool-id ipam-pool-0b325bb4efc6dacae", I properly get returned all IpamPoolAllocations.

  • We have not changed anything of the Terraform code in regards to the IPAM pool allocations.
  • 3 different persons tried running the setup with assuming role "workload-terraform-role" and ran into the issue too.
  • We also tried running aws_vpc_ipam_pool_cidr_allocation in a different AWS account and ran also into that issue.
  • On previous runs couple weeks/months ago, the same code correctly created the aws_vpc_ipam_pool_cidr_allocation without throwing the error.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {
   count = var.cf_subnet_infra_count
   ipam_pool_id = var.ipam_pool_id
   netmask_length = 27
}

Steps to Reproduce

  1. Create an IPAM pool
  2. Try to allocate a CIDR in the pool

Debug Output

2023-01-16T10:54:34.077+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Action=GetIpamPoolAllocations&IpamPoolAllocationId=ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01&IpamPoolId=ipam-pool-0b325bb4efc6dacae&Version=2016-11-15
2023-01-16T10:54:34.077+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: -----------------------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Response ec2/GetIpamPoolAllocations Details:
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: ---[ RESPONSE ]--------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: HTTP/1.1 400 Bad Request
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Connection: close
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Transfer-Encoding: chunked
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Cache-Control: no-cache, no-store
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Content-Type: text/xml;charset=UTF-8
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Date: Mon, 16 Jan 2023 09:54:33 GMT
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Server: AmazonEC2
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Strict-Transport-Security: max-age=31536000; includeSubDomains
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Vary: accept-encoding
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: X-Amzn-Requestid: 9683f21c-8972-4c40-8227-72f5c219e5d3
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: 
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: 
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: -----------------------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: <Response><Errors><Error><Code>InvalidIpamPoolAllocationId.NotFound</Code><Message>The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.</Message></Error></Errors><RequestID>9683f21c-8972-4c40-8227-72f5c219e5d3</RequestID></Response>
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/GetIpamPoolAllocations failed, attempt 0/25, error InvalidIpamPoolAllocationId.NotFound: The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5:     status code: 400, request id: 9683f21c-8972-4c40-8227-72f5c219e5d3

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None




Alexander Barth (alexander.barth@mercedes-benz.com) on behalf of Mercedes-Benz Tech Innovation GmbH, Provider Information

@AlexBarth13 AlexBarth13 added bug Addresses a defect in current functionality. needs-triage Waiting for first response or review from a maintainer. labels Jan 16, 2023
@github-actions
Copy link

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added the service/ipam Issues and PRs that pertain to the ipam service. label Jan 16, 2023
@Tailzip
Copy link
Contributor

Tailzip commented Jan 17, 2023

We're seeing similar issues with that resource as well (using 4.50.0).
IPAM pool isn't shared with RAM in our case, all operations happen in the same AWS account.

First attempt

plan

# aws_vpc_ipam_pool_cidr_allocation.workload[0] will be created
+ resource "aws_vpc_ipam_pool_cidr_allocation" "workload" {
  + cidr                    = (known after apply)
  + description             = "some desc"
  + id                      = (known after apply)
  + ipam_pool_allocation_id = (known after apply)
  + ipam_pool_id            = "ipam-pool-xxxxxxxxxx"
  + netmask_length          = 25
  + resource_id             = (known after apply)
  + resource_owner          = (known after apply)
  + resource_type           = (known after apply)
}

apply

apply fails with the following error message, however if we check the AWS Console the allocation is well created in IPAM service.

Error: reading IPAM Pool CIDR Allocation (ipam-pool-alloc-xxxxxxxxxxx_ipam-pool-xxxxxxxxxx): couldn't find resource

Second attempt

plan

Resource is shown as tainted.

# aws_vpc_ipam_pool_cidr_allocation.workload[0] is tainted, so must be replaced
-/+ resource "aws_vpc_ipam_pool_cidr_allocation" "workload" {
  + cidr                    = (known after apply)
  ~ id                      = "ipam-pool-alloc-xxxxxxxxxxx_ipam-pool-xxxxxxxxxx" -> (known after apply)
  + ipam_pool_allocation_id = (known after apply)
  + resource_id             = (known after apply)
  + resource_owner          = (known after apply)
  + resource_type           = (known after apply)
    # (3 unchanged attributes hidden)
}

apply

Because the resource is tainted, it is being deleted, but that fails as well.

Error: deleting IPAM Pool CIDR Allocation (ipam-pool-alloc-xxxxxx_ipam-pool-xxxxx): InvalidParameterValue: The CIDR specified :  is not in proper format.

(^ not a typo in error message, a value is missing)

EDIT: We've opened a case with AWS support in the meantime, as we believe this is likely to be an issue with AWS IPAM service API rather than the provider. We were able to replicate the issue with AWS CLI as well.

@bebold-jhr
Copy link

bebold-jhr commented Jan 17, 2023

We have the exact same problem as @Tailzip. We are referencing the cidr in a local. It seems that the local is being evaluated way too early. The terraform resource might indicate that the allocation is done, but it seems like it's still ongoing asynchronously in AWS.

We tried to remove the tainted resource and then import the resource, but that doesn't seem to work. Afterwards it showed us a completely new resource being created. Applying that will result in the mentioned error again ("couldn't find resource").
The import statement looks a little bit weird as well. The text says that the allocation id is used for the import, but the example only shows the resource.

EDIT: We verified the problem with 4.19.0, 4.46.0 and 4.48.0. It seems that it started last week as sporadic behavior, but now this is a constant behavior.

@justinretzolk
Copy link
Member

Potentially related: #25300

@justinretzolk justinretzolk added eventual-consistency Pertains to eventual consistency issues. and removed needs-triage Waiting for first response or review from a maintainer. labels Jan 17, 2023
@vgadde-mck
Copy link

vgadde-mck commented Jan 17, 2023

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following
aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID
I consistently get correct result.

AWS will not support unless we show them that their CLI also fails

@Tailzip
Copy link
Contributor

Tailzip commented Jan 18, 2023

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result.

AWS will not support unless we show them that their CLI also fails

I've been running the following script, and issue happens randomly after a couple runs

script.sh
#!/bin/bash

set -e

export AWS_REGION=eu-central-1
export AWS_DEFAULT_REGION=eu-central-1
export AWS_DEFAULT_OUTPUT=json

IPAM_POOL_ID="ipam-pool-xxxxxxxxxxxxxxxxx"
ALLOCATION_ID="$(aws ec2 allocate-ipam-pool-cidr --ipam-pool-id "$IPAM_POOL_ID" --netmask-length 25 --description 'troubleshoot' | jq -c -r '.IpamPoolAllocation.IpamPoolAllocationId')"

aws ec2 get-ipam-pool-allocations \
    --ipam-pool-id "$IPAM_POOL_ID" \
    --ipam-pool-allocation-id "$ALLOCATION_ID" \
    --no-cli-pager

@MiliDurasovic
Copy link

MiliDurasovic commented Jan 18, 2023

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result.
AWS will not support unless we show them that their CLI also fails

I've been running the following script, and issue happens randomly after a couple runs

script.sh

#!/bin/bash

set -e

export AWS_REGION=eu-central-1
export AWS_DEFAULT_REGION=eu-central-1
export AWS_DEFAULT_OUTPUT=json

IPAM_POOL_ID="ipam-pool-xxxxxxxxxxxxxxxxx"
ALLOCATION_ID="$(aws ec2 allocate-ipam-pool-cidr --ipam-pool-id "$IPAM_POOL_ID" --netmask-length 25 --description 'troubleshoot' | jq -c -r '.IpamPoolAllocation.IpamPoolAllocationId')"

aws ec2 get-ipam-pool-allocations \
    --ipam-pool-id "$IPAM_POOL_ID" \
    --ipam-pool-allocation-id "$ALLOCATION_ID" \
    --no-cli-pager

Interesting. For me the script always runs through without any issues and the creation through terraform still throws the error InvalidIpamPoolAllocationId.NotFound. Exact same IAM-Role used.

Mili Durasovic mili.durasovic@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH
Provider Information

@AdamTylerLynch
Copy link
Collaborator

We have the exact same problem as @Tailzip. We are referencing the cidr in a local. It seems that the local is being evaluated way too early. The terraform resource might indicate that the allocation is done, but it seems like it's still ongoing asynchronously in AWS.

We tried to remove the tainted resource and then import the resource, but that doesn't seem to work. Afterwards it showed us a completely new resource being created. Applying that will result in the mentioned error again ("couldn't find resource"). The import statement looks a little bit weird as well. The text says that the allocation id is used for the import, but the example only shows the resource.

EDIT: We verified the problem with 4.19.0, 4.46.0 and 4.48.0. It seems that it started last week as sporadic behavior, but now this is a constant behavior.

@bebold-jhr can you please log a separate GitHub issue as a bug regarding the your import observations? Thank you.

@bebold-jhr
Copy link

@AdamTylerLynch I will do that. I just thought it was worth mentioning that using import was not a workaround for us.

@AdamTylerLynch
Copy link
Collaborator

Appears to be related to #25300

@Tailzip
Copy link
Contributor

Tailzip commented Jan 18, 2023

Got confirmation from IPAM service team via AWS Enterprise support that:

GetIpamPoolAllocations is eventually consistent with respect to AllocateIpamPoolCidr. However, AllocateIpamPoolCidr is strongly consistent with respect to other calls to AllocateIpamPoolCidr -- IPAM will not hand out overlapping space within a pool, even if customers call AllocateIpamPoolCidr several times simultaneously on the pool. The API is eventually consistent. We recommended retrying/waiting a couple seconds.

I guess we need #25300 to be resolved then 😄

@AdamTylerLynch
Copy link
Collaborator

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result.

AWS will not support unless we show them that their CLI also fails

A quick clarification, AWS Enterprise Support does offer Third-Party Product support, including open source software such as Terraform. I agree that having a reproducible case in a script using the AWS CLI is certainly helpful, though not required.

AWS works with Hashicorp and the open source community to evaluate and prioritize issues as per the Terraform AWS Provider FAQ.

@JonasWieneke
Copy link

JonasWieneke commented Jan 19, 2023

Further testing yielded the following results:

  • Test Case 1:
    The following snippet was executed in another account to which the ipam pool was shared, region for execution was eu-central-1:
resource "aws_vpc_ipam_pool_cidr_allocation" "ipam-test-allocation" {
  count          = 3
  ipam_pool_id   = var.ipam_shared_pool_172_id
  netmask_length = var.lb_subnet_netmask
}

resource "aws_vpc_ipam_pool_cidr_allocation" "ipam-test-allocation_2" {
  count          = 3
  ipam_pool_id   = var.ipam_shared_pool_172_id
  netmask_length = var.lb_subnet_netmask
}

The first run of the code fails for some of the resources. Multiple iterations of this code will result in progressively more successful resources being created. It usually takes between 3 and 6 consecutive attempts to fully create all six resources.

  • Test Case 2:
    The following snippet was executed with a new account set. An ipam is shared with the account running this script, region for execution was eu-west-1:
resource "aws_vpc_ipam_pool_cidr_allocation" "ipam-test-allocation" {
  count          = 3
  ipam_pool_id   = var.ipam_shared_pool_172_id
  netmask_length = var.lb_subnet_netmask
}

resource "aws_vpc_ipam_pool_cidr_allocation" "ipam-test-allocation_2" {
  count          = 3
  ipam_pool_id   = var.ipam_shared_pool_172_id
  netmask_length = var.lb_subnet_netmask
}

The execution of this script just ran fine and resulted in no errors at all.

From my point of view this could be a timing problem in the resourceIPAMPoolCIDRAllocationCreate function. It seems that the resource is already being read before the creation is fully completed.

Jonas Wieneke <jonas.wieneke@mercedes-benz.com>, Mercedes-Benz Tech Innovation GmbH
Provider Information

@AdamTylerLynch
Copy link
Collaborator

I can successfully reproduce in an AccTests. Working on a fix.

@kevinkupski
Copy link
Contributor

kevinkupski commented Jan 19, 2023

I can successfully reproduce in an AccTests. Working on a fix.

I also just started on this issue and added this code block to ipam_pool_cidr_allocation.go

	// Handle eventual consitency of the API and therefor retry the read
	return resource.Retry(time.Minute, func() *resource.RetryError {
		err = resourceIPAMPoolCIDRAllocationRead(d, meta)

		if err != nil {
			if tfresource.NotFound(err) {
				return resource.RetryableError(fmt.Errorf("IPAM Pool CIDR Allocation (%s) not yet ready", d.Id()))
			} else {
				return resource.NonRetryableError(err)
			}
		}

		return nil
	})

We need this change urgently. Do you work on this within the next few days or should I open a PR? If the latter, could you share your test code?

@AdamTylerLynch
Copy link
Collaborator

AdamTylerLynch commented Jan 20, 2023

Hello Kevin, thanks for putting the effort in for the sample code! In the provider we have mechanisms for retries and waiting (retries and waiters), and our PR guidelines suggest that we follow any exiting patterns in the resource being modified.

I have added the mechanisms for retry and waiting to account for eventually consistency of the read operation, and I've added additional acceptance tests to verify cross region pool CIDR allocation.

I assure you this is being worked on. The provider team does releases each Thursday.

@github-actions
Copy link

This functionality has been released in v4.52.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/ipam Issues and PRs that pertain to the ipam service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants