Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

aws flake aws_nat_gateway.nat_gw[n]: index n out of range #1246

Closed
s-urbaniak opened this issue Jun 29, 2017 · 8 comments · Fixed by hashicorp/terraform-provider-aws#1053
Closed
Assignees
Labels

Comments

@s-urbaniak
Copy link
Contributor

Reported in #1054 (comment)

Error applying plan:
1 error(s) occurred:
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}
@s-urbaniak
Copy link
Contributor Author

CI jenkins log, filtered by the relationship between aws_nat_gateway and nat_eip:

$ egrep 'nat_eip|aws_nat_gateway' jenkins-tectonic-installer.prod.coreos.systems.txt 
+ module.vpc.aws_eip.nat_eip.0
+ module.vpc.aws_eip.nat_eip.1
+ module.vpc.aws_eip.nat_eip.2
+ module.vpc.aws_nat_gateway.nat_gw.0
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.1
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.2
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
module.vpc.aws_eip.nat_eip.1: Creating...
module.vpc.aws_eip.nat_eip.2: Creating...
module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.0: Creating...
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)
module.vpc.aws_nat_gateway.nat_gw.1: Creating...
module.vpc.aws_nat_gateway.nat_gw.0: Creating...
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Creation complete (ID: nat-0a6cebc0930c8adb4)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Creation complete (ID: nat-0036bd8641f070425)
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

Noteful observation. While the first two EIPs (nat_eip.0, nat_eip.1) get allocation IDs from AWS printed out, nat_eip.2 does not get a allocation ID printed out:

module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)

I'll have look in terraform, maybe there are different code paths for the creation events above.

@s-urbaniak
Copy link
Contributor Author

s-urbaniak commented Jun 29, 2017

Another noteful observartion: TF doesnt' even start (yet?) to create nat_gw.2.

@s-urbaniak
Copy link
Contributor Author

Final investigation result: TF is absolutely happy, if you don't set any ID for a resource you created. In that case it also won't show up in the state file, it simply doesn't "exist" from a graph perspective and hence provokes the out of range error above.

So searching through the code path I found exactly one very suspicious code point, where resource_aws_eip.go explicitely sets the ID to empty string, swallowing InvalidAllocationID.NotFound and InvalidAddress.NotFound errors from AWS and carrying on happily in [1].

Iff AWS returns one of the above errors, all of the above Jenkins logs make perfect sense. After discussing with @alexsomesan reading the EIP after creating it should be retry-able, because a read call immediately after a create call does not necessarily return valid EIPs yet, when the change is propagated inside the AWS control plane.

The code [1] should be implemented in a retry-able fashion to be bullet-proof.

[1] https://github.com/hashicorp/terraform/blob/v0.9.9/builtin/providers/aws/resource_aws_eip.go#L137-L140

s-urbaniak pushed a commit to s-urbaniak/terraform that referenced this issue Jun 29, 2017
s-urbaniak pushed a commit to s-urbaniak/terraform that referenced this issue Jun 29, 2017
@Quentin-M
Copy link
Contributor

Great stuff! Will you make that a PR upstream?

@s-urbaniak
Copy link
Contributor Author

@Quentin-M yes, once we verify on #1247 that we got rid of this very flake, I'll definitely push it upstream.

@squat
Copy link
Contributor

squat commented Jul 14, 2017

@s-urbaniak this is solved by your flake-resistant fork, right? Can we close or are we tracking upstream?

@s-urbaniak
Copy link
Contributor Author

@squat yes, this is solved in the fork. Nevertheless I would suggest to keep this open until the upstream PR [1] is not merged.

[1] hashicorp/terraform-provider-aws#1053

radeksimko pushed a commit to s-urbaniak/terraform-provider-aws that referenced this issue Jul 18, 2017
@s-urbaniak
Copy link
Contributor Author

upstream got merged, hence closing!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants