aws_eks_addon creation race condition #20404

paulgear · 2021-08-02T05:43:54Z

Description

When created too soon after the EKS cluster (presumably before or during nodegroup creation), the aws_eks_addon resource doesn't always create correctly.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v1.0.0
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.52.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.0.2
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

aws_eks_addon

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_eks_cluster" "main" {
  name    = var.cluster_name
  version = var.cluster_version

  role_arn = aws_iam_role.cluster.arn
  ...
  tags = var.tags

  depends_on = [
    aws_cloudwatch_log_group.main,
    aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,
    aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
  ]
}

resource "aws_eks_addon" "coredns" {
  cluster_name  = aws_eks_cluster.main.cluster_name
  addon_name    = "coredns"
}

Debug Output

This will be provided later if needed, once I've redacted it sufficiently.

Panic Output

n/a

Expected Behavior

Degraded seems to be a fairly normal state for initial creation of EKS add-ons when the cluster is fairly new. The provider should wait long enough for the add-on to transition from degraded to active.

Actual Behavior

Error when applying initial configuration:

Error: unexpected EKS Add-On (CLUSTERNAME:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s(<nil>)

A second apply works fine.

Steps to Reproduce

terraform apply

Important Factoids

Adding a manual dependency on the nodegroup resource avoids this race.

References

https://github.com/hashicorp/terraform-provider-aws/blob/main/aws/resource_aws_eks_addon.go#L159 is the part of the code where the error is generated.

The text was updated successfully, but these errors were encountered:

wcarlsen · 2021-08-03T08:14:42Z

@paulgear we see this issue too, but adding the manual dependency on the nodegroup resource doesn't work for us. Do you any more insights?

paulgear · 2021-08-03T08:31:31Z

@wcarlsen Maybe try a cluster readiness check like this? https://github.com/cmdlabs/cmd-tf-aws-eks/blob/master/cluster/auth-cm.tf#L22

z0rc · 2021-08-03T09:55:27Z

Actually it's quite possible to create EKS cluster with addons but without any workers, at least AWS Web Console does this when creating cluster. Obviously coredns deployment will be in degraded state until some worker nodes are available.

I think there are two ways to handle this:

In terraform code by adding something like depends_on = [aws_eks_node_group.workers] to coredns aws_eks_addon resource.
In provider by allowing DEGRADED state in resource configuration or extending error handling.

wcarlsen · 2021-08-03T11:03:20Z

Thanks for the input @paulgear, but I still didn't manage to get it working. I also tried out @z0rc's suggestion with adding a dependency between node group workers and the coreDNS addon without any luck. I guess we will have to wait around for the latter to be fixed and do the good "double apply" trick.

tkjwa · 2021-08-03T11:42:20Z

I'm having the same issue since yesterday. I had a working run before the week-end.
From my TF Cloud runs history on July 30th 2021, 3:30:40 pm

...
aws_eks_addon.k8s_vpc_addon: Creating...
aws_eks_addon.k8s_vpc_addon: Creation complete after 3s [id=platform-staging:vpc-cni]
aws_eks_addon.k8s_proxy_addon: Creation complete after 6s [id=platform-staging:kube-proxy]
aws_eks_addon.k8s_coredns_addon: Still creating... [10s elapsed]
aws_eks_addon.k8s_coredns_addon: Creation complete after 17s [id=platform-staging:coredns]
aws_eks_node_group.node_group: Creating...
aws_eks_node_group.node_group: Still creating... [10s elapsed]
aws_eks_node_group.node_group: Still creating... [20s elapsed]
aws_eks_node_group.node_group: Still creating... [30s elapsed]
...
aws_eks_node_group.node_group: Creation complete after 3m7s [id=platform-staging:platform-node-group-staging]

Apply complete! Resources: 14 added, 0 changed, 0 destroyed.

We can see that the addons were created before the node group without any error, since yesterday i get also the following:

unexpected EKS Add-On (platform-staging:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s()

If I add a dependency on the addon definition relative to the node group then the creation goes fine but i end up with some ENI and SG left after the cluster deletion :(

ghost · 2021-08-05T07:28:19Z

I'm also having the same issue, but with a slightly different use case.

Last week I was able to provision the addon and then patch the deployment to run on fargate (https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html#fargate-gs-coredns), unfortunately this is no longer possible due to the following error:

unexpected EKS Add-On (example:coredns) state returned during creation: unexpected state 'DEGRADED', wanted target 'ACTIVE'. last error: %!s()

I've also been able to replicate this using older versions of the provider (such as v3.47.0) and this still occurs.

Edit: In this case the EKS cluster is fargate only, with no node groups.

…vider-aws#20404

abstrask · 2021-08-09T16:45:55Z

We use un-managed node groups (aka. plain auto-scaling groups) controlled by one Terraform module, and manage EKS add-ons through another module. This workaround seems to do the trick:

In our main EKS cluster module:

module "eks_addons" {
  source = "../../_sub/compute/eks-addons"
  depends_on = [
    module.eks_cluster,
    module.eks_nodegroup1_workers,
    module.eks_nodegroup2_workers
  ] # added explicit dependencies on node group modules, as a workaround to dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404

  ...
}

In our un-managed node group sub-module:

resource "aws_autoscaling_group" "eks" {
  ...

  provisioner "local-exec" {
    command = "sleep 60" # added arbitrary delay to allow ASG to spin up instances, as a workaround to dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404
  }
}

See also dfds/infrastructure-modules#276.

…vider-aws#20404 (#276)

* Workaround/fix for dfds/cloudplatform#380 and hashicorp/terraform-provider-aws#20404 * Re-enable QA destroy steps Co-authored-by: abstrask <rask.misc@gmail.com> Co-authored-by: Rasmus Rask <raras@dfds.com>

github-actions · 2021-08-19T22:40:06Z

This functionality has been released in v3.55.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions · 2021-09-19T02:08:22Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/eks Issues and PRs that pertain to the eks service. labels Aug 2, 2021

ewbankkit added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Aug 5, 2021

abstrask added a commit to dfds/infrastructure-modules that referenced this issue Aug 9, 2021

Workaround/fix for dfds/cloudplatform#380 and hashicorp/terraform-pro…

888bdb3

…vider-aws#20404

abstrask pushed a commit to dfds/infrastructure-modules that referenced this issue Aug 10, 2021

Workaround/fix for dfds/cloudplatform#380 and hashicorp/terraform-pro…

b0fcb41

…vider-aws#20404 (#276)

z0rc mentioned this issue Aug 12, 2021

Manage aws_eks_addon resources cloudposse/terraform-aws-eks-cluster#125

Merged

nitrocode mentioned this issue Aug 12, 2021

More addons cloudposse/terraform-aws-eks-cluster#126

Closed

ewbankkit mentioned this issue Aug 13, 2021

r/aws_eks_addon: Update addon versions for TestAccAWSEksAddon_AddonVersion #20562

Merged

ewbankkit closed this as completed in #20562 Aug 18, 2021

github-actions bot added this to the v3.55.0 milestone Aug 18, 2021

github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_eks_addon creation race condition #20404

aws_eks_addon creation race condition #20404

paulgear commented Aug 2, 2021 •

edited

Loading

wcarlsen commented Aug 3, 2021

paulgear commented Aug 3, 2021

z0rc commented Aug 3, 2021 •

edited

Loading

wcarlsen commented Aug 3, 2021

tkjwa commented Aug 3, 2021

ghost commented Aug 5, 2021 •

edited by ghost

Loading

abstrask commented Aug 9, 2021 •

edited

Loading

github-actions bot commented Aug 19, 2021

github-actions bot commented Sep 19, 2021

aws_eks_addon creation race condition #20404

aws_eks_addon creation race condition #20404

Comments

paulgear commented Aug 2, 2021 • edited Loading

Description

Community Note

Terraform CLI and Terraform AWS Provider Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

wcarlsen commented Aug 3, 2021

paulgear commented Aug 3, 2021

z0rc commented Aug 3, 2021 • edited Loading

wcarlsen commented Aug 3, 2021

tkjwa commented Aug 3, 2021

ghost commented Aug 5, 2021 • edited by ghost Loading

abstrask commented Aug 9, 2021 • edited Loading

github-actions bot commented Aug 19, 2021

github-actions bot commented Sep 19, 2021

paulgear commented Aug 2, 2021 •

edited

Loading

z0rc commented Aug 3, 2021 •

edited

Loading

ghost commented Aug 5, 2021 •

edited by ghost

Loading

abstrask commented Aug 9, 2021 •

edited

Loading