Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security groups attached to EC2 instance are flapping #3103

Closed
joshuaspence opened this issue Jan 23, 2018 · 15 comments
Closed

Security groups attached to EC2 instance are flapping #3103

joshuaspence opened this issue Jan 23, 2018 · 15 comments
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. upstream-terraform Addresses functionality related to the Terraform core binary.

Comments

@joshuaspence
Copy link
Contributor

I apologize in advance that I am not able to provide reproduction steps for this bug. I have spent most of the day attempting to work out what is going on here (and, more importantly, working out how to reproduce this issue, but I have thus far come up empty handed.

I am using Terraform 0.11.2 and version 1.7.0 of the AWS provider. I have a rather large Terraform stack, but I am having issues with resources created by the following code:

module "aphlict_instance" {
  source = "../../../modules/instance"

  ami_codename            = "xenial"
  ami_virtualization_type = "hvm"
  ami_storage_type        = "ebs"

  instance_type          = "${var.instance_type["phabricator_notifications"]}"
  vpc_security_group_ids = [
    "${aws_security_group.aphlict.id}",
    "${module.primary_stack.common_security_group}",
  ]
  subnet_ids             = "${module.primary_stack.private_subnet_ids}"
  iam_instance_profile   = "${module.aphlict_iam.instance_profile}"

  name        = "Phabricator Notifications"
  product     = "${local.product}"
  environment = "${local.environment}"
  puppet_role = "phabricator_notifications"

  count = 3
}

In the state file, the aws_instance resource created by our internal instance module looks like this (I have run terraform refresh to ensure the state file is up-to-date):

"aws_instance.instance.0": {
    "type": "aws_instance",
    "depends_on": [
        "local.tags",
        "module.ami",
        "module.ec2_instance_info",
        "module.user_data"
    ],
    "primary": {
        "id": "i-02421ca637656af29",
        "attributes": {
            "ami": "ami-885545eb",
            "associate_public_ip_address": "false",
            "availability_zone": "ap-southeast-2a",
            "disable_api_termination": "false",
            "ebs_block_device.#": "0",
            "ebs_optimized": "false",
            "ephemeral_block_device.#": "0",
            "iam_instance_profile": "phabricator_notifications-00887d7806bd92cce7f9e87976",
            "id": "i-02421ca637656af29",
            "instance_state": "running",
            "instance_type": "t2.small",
            "ipv6_addresses.#": "0",
            "key_name": "",
            "monitoring": "true",
            "network_interface.#": "0",
            "network_interface_id": "eni-a5f9d2df",
            "placement_group": "",
            "primary_network_interface_id": "eni-a5f9d2df",
            "private_dns": "ip-10-50-1-105.ap-southeast-2.compute.internal",
            "private_ip": "10.50.1.105",
            "public_dns": "",
            "public_ip": "",
            "root_block_device.#": "1",
            "root_block_device.0.delete_on_termination": "true",
            "root_block_device.0.iops": "100",
            "root_block_device.0.volume_id": "vol-0bbae462bdf205c5e",
            "root_block_device.0.volume_size": "8",
            "root_block_device.0.volume_type": "gp2",
            "security_groups.#": "0",
            "source_dest_check": "true",
            "subnet_id": "subnet-d4aaefb0",
            "tags.%": "6",
            "tags.Name": "Phabricator Notifications",
            "tags.environment": "stg",
            "tags.product": "tools",
            "tags.profile": "",
            "tags.role": "phabricator_notifications",
            "tags.stack": "phabricator",
            "tenancy": "default",
            "user_data": "3f22672ad4485e401937352232c688f4ea09ed99",
            "volume_tags.%": "0",
            "vpc_security_group_ids.#": "3",
            "vpc_security_group_ids.1060145146": "sg-7235af14",
            "vpc_security_group_ids.1572026055": "sg-01c89e66",
            "vpc_security_group_ids.881524209": "sg-a8108ecf"
        },
        "meta": {
            "e2bfb730-ecaa-11e6-8f88-34363bc7c4c0": {
                "create": 600000000000,
                "delete": 600000000000,
                "update": 600000000000
            },
            "schema_version": "1"
        },
        "tainted": false
    },
    "deposed": [],
    "provider": "provider.aws"
},

Running terraform plan shows the following changes to be applied:

  ~ module.aphlict_instance.aws_instance.instance[0]
      vpc_security_group_ids.#:          "3" => "4"
      vpc_security_group_ids.3281223323: "" => "sg-232c3a45"

Running terraform apply does seem to succeed:

module.aphlict_instance.aws_instance.instance[0]: Modifying... (ID: i-02421ca637656af29)
  vpc_security_group_ids.#:          "3" => "4"
  vpc_security_group_ids.3281223323: "" => "sg-232c3a45"
module.aphlict_instance.aws_instance.instance[0]: Modifications complete after 3s (ID: i-02421ca637656af29)

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

However, the instance now only has a single security group (sg-232c3a45). The pre-existing security groups (sg-7235af14, sg-01c89e66 and sg-a8108ecf) seem to have been removed.

> aws ec2 describe-instances --instance-ids i-02421ca637656af29 --query 'Reservations[0].Instances[0].SecurityGroups'
[
    {
        "GroupName": "monitoring-20180119005837068100000013", 
        "GroupId": "sg-232c3a45"
    }
]

This also matches what is stored in the state file:

"aws_instance.instance.0": {
    "type": "aws_instance",
    "depends_on": [
        "local.tags",
        "module.ami",
        "module.ec2_instance_info",
        "module.user_data"
    ],
    "primary": {
        "id": "i-02421ca637656af29",
        "attributes": {
            "ami": "ami-885545eb",
            "associate_public_ip_address": "false",
            "availability_zone": "ap-southeast-2a",
            "disable_api_termination": "false",
            "ebs_block_device.#": "0",
            "ebs_optimized": "false",
            "ephemeral_block_device.#": "0",
            "iam_instance_profile": "phabricator_notifications-00887d7806bd92cce7f9e87976",
            "id": "i-02421ca637656af29",
            "instance_state": "running",
            "instance_type": "t2.small",
            "ipv6_addresses.#": "0",
            "key_name": "",
            "monitoring": "true",
            "network_interface.#": "0",
            "network_interface_id": "eni-a5f9d2df",
            "placement_group": "",
            "primary_network_interface_id": "eni-a5f9d2df",
            "private_dns": "ip-10-50-1-105.ap-southeast-2.compute.internal",
            "private_ip": "10.50.1.105",
            "public_dns": "",
            "public_ip": "",
            "root_block_device.#": "1",
            "root_block_device.0.delete_on_termination": "true",
            "root_block_device.0.iops": "100",
            "root_block_device.0.volume_id": "vol-0bbae462bdf205c5e",
            "root_block_device.0.volume_size": "8",
            "root_block_device.0.volume_type": "gp2",
            "security_groups.#": "0",
            "source_dest_check": "true",
            "subnet_id": "subnet-d4aaefb0",
            "tags.%": "6",
            "tags.Name": "Phabricator Notifications",
            "tags.environment": "stg",
            "tags.product": "tools",
            "tags.profile": "",
            "tags.role": "phabricator_notifications",
            "tags.stack": "phabricator",
            "tenancy": "default",
            "user_data": "3f22672ad4485e401937352232c688f4ea09ed99",
            "volume_tags.%": "0",
            "vpc_security_group_ids.#": "1",
            "vpc_security_group_ids.3281223323": "sg-232c3a45"
        },
        "meta": {
            "e2bfb730-ecaa-11e6-8f88-34363bc7c4c0": {
                "create": 600000000000,
                "delete": 600000000000,
                "update": 600000000000
            },
            "schema_version": "1"
        },
        "tainted": false
    },
    "deposed": [],
    "provider": "provider.aws"
},

Subsequently running terraform plan shows the opposite changes to be applied:

  ~ module.aphlict_instance.aws_instance.instance[0]
      vpc_security_group_ids.#:          "1" => "4"
      vpc_security_group_ids.1060145146: "" => "sg-7235af14"
      vpc_security_group_ids.1572026055: "" => "sg-01c89e66"
      vpc_security_group_ids.881524209:  "" => "sg-a8108ecf"
@jbardin jbardin added the bug Addresses a defect in current functionality. label Jan 23, 2018
@bflad bflad added the service/ec2 Issues and PRs that pertain to the ec2 service. label Jan 23, 2018
@peimanja
Copy link

+1

6 similar comments
@nikosmeds
Copy link

+1

@scottybrisbane
Copy link

+1

@dpacaud
Copy link

dpacaud commented Jan 30, 2018

+1

@thlacroix
Copy link
Contributor

+1

@rhardouin
Copy link

+1

@abrechon
Copy link

+1

@STRML
Copy link

STRML commented Feb 6, 2018

This is an extremely dangerous bug as what is executed is not what's in the plan, and you can end up cutting off network traffic to a live instance. We can't isolate reproduction steps either. It appears that terraform is unable to pull a complete set of rules from some secgroups, and a modify operation fails. That modify operation failure appears to also infect the tainting of groups on an instance. I don't have better info than that just yet.

On further investigation we have the same issue. We have six SGs, but terraform keeps flapping us from "4" => "6", and "2" => "6" and each time we do, it goes from 2 -> 4 -> 2 SGs. Each time it does not indicate it will remove existing groups! It appears the API call the provider is using sets all SGs in a single call and it fails to include the existing ones in the payload.

@STRML
Copy link

STRML commented Feb 6, 2018

Okay, we can replicate this across any of our instances by doing the following:

  1. Create a few instances with a few SGs
  2. Change the name of one of the SGs so it forces a recreate
  3. Witness another terraform bug on apply, where the SG is never deleted because Terraform doesn't know it needs to be removed from every instance it's associated with.
    3a. I usually remedy this by manually removing the SG, which causes the delete to go through.
  4. On a subsequent plan, TF notices the SG is missing and issues a plan and apply like this:
Terraform will perform the following actions:

  ~ aws_instance.xxx
      vpc_security_group_ids.#:         "5" => "6"
      vpc_security_group_ids.1866xxxx: "" => "sg-4d51xxxx"

....

aws_instance.xxx: Modifying... (ID: i-xxxx)
  vpc_security_group_ids.#:         "5" => "6"
  vpc_security_group_ids.1866xxxx: "" => "sg-4d51xxxx"

Applying this deletes all other security groups so only sg-4d51xxxx remains.

This is incredibly dangerous and should be triaged as a critical bug.

@dpacaud
Copy link

dpacaud commented Feb 6, 2018

We have experienced the same behavior as stated by @STRML and we also think this bug should be triaged as critical.
we are thinking of downgrading the aws provider version to mitigate this issue

@c-nichols
Copy link

Agreed @dpacaud we are deferring provider upgrades until this is addressed

@cesc1989
Copy link

cesc1989 commented Mar 2, 2018

Happened almost the same but no so dangerous as in @STRML case.

Terraform v0.11.0
provider.aws v1.3.0

terraform plan

(...)

Terraform will perform the following actions:

  ~ aws_instance.base_api_server_4
      vpc_security_group_ids.#:         "0" => "1"
      vpc_security_group_ids.610043384: "" => "sg-8a0632f6"
terraform apply

(...)

aws_instance.base_api_server_4: Modifying... (ID: i-0429b28b8bbcefdd9)
  vpc_security_group_ids.#:         "0" => "1"
  vpc_security_group_ids.610043384: "" => "sg-8a0632f6"
aws_instance.base_api_server_4: Modifications complete after 4s (ID: i-0429b28b8bbcefdd9)

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Then run plan again and...

terraform plan

(...)

Terraform will perform the following actions:

  ~ aws_instance.base_api_server_4
      vpc_security_group_ids.#:         "0" => "1"
      vpc_security_group_ids.610043384: "" => "sg-8a0632f6"

@bflad bflad self-assigned this Mar 7, 2018
@bflad bflad modified the milestones: v1.11.0, v1.12.0 Mar 7, 2018
@bflad bflad modified the milestones: v1.12.0, v1.13.0, v1.14.0 Mar 23, 2018
@bflad bflad modified the milestones: v1.14.0, v1.14.1 Apr 6, 2018
@bflad
Copy link
Contributor

bflad commented Apr 11, 2018

Hi everyone. 👋 I spent awhile looking into this and could not come up with a reproduction configuration on Terraform 0.11.2 + AWS 1.7.0 or Terraform 0.11.6 + AWS 1.14.0. I also notice the reports have dropped off since two months ago.

If someone could provide a minimal reproduction configuration and/or debug logs of this in action on the latest Terraform and AWS provider versions, that would be very helpful. The main code dealing with vpc_security_group_ids updates has not changed much at all over the last year, but maybe this is something upstream in Terraform core.

What strikes me as odd here is this output from your plan diff:

  ~ module.aphlict_instance.aws_instance.instance[0]
      vpc_security_group_ids.#:          "3" => "4"
      vpc_security_group_ids.3281223323: "" => "sg-232c3a45"

Meanwhile in the various flat and modularized Terraform configurations I tried on both older and newer versions, it shows the old values in the plan diff and applies the additions correctly:

  ~ aws_instance.test[0]
      vpc_security_group_ids.#:          "3" => "4"
      vpc_security_group_ids.1014795448: "sg-7e830700" => "sg-7e830700"
      vpc_security_group_ids.1939834146: "sg-408d093e" => "sg-408d093e"
      vpc_security_group_ids.390029114:  "sg-a18c08df" => "sg-a18c08df"
      vpc_security_group_ids.445449786:  "" => "sg-1a800464"

I wonder if that is related to the d.Get("vpc_security_group_ids") call not being able to fetch information or maybe targeted plan/apply.

@cesc1989 if that instance is/was in a default VPC, that issue should be fixed as of version 1.9.0 of the AWS provider.

@bflad bflad removed this from the v1.14.1 milestone Apr 11, 2018
@bflad bflad removed their assignment Apr 11, 2018
@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Apr 11, 2018
@bflad
Copy link
Contributor

bflad commented Apr 11, 2018

Looking back at Terraform core commits upstream around the timeframe of these reports, this commit upstream (released in Terraform 0.11.3) seemed suspect: hashicorp/terraform@8d1e479

Tracing this back I found this issue upstream which affected only Terraform 0.11.2 and seemingly configurations with ignore_changes defined in some manner: hashicorp/terraform#17117

So, hopefully mystery solved. Please ping me if this needs to be reopened but Terraform core versions not equal to 0.11.2 should work fine.

@bflad bflad closed this as completed Apr 11, 2018
@bflad bflad added upstream-terraform Addresses functionality related to the Terraform core binary. and removed waiting-response Maintainers are waiting on response from community or contributor. labels Apr 11, 2018
@ghost
Copy link

ghost commented Apr 6, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. upstream-terraform Addresses functionality related to the Terraform core binary.
Projects
None yet
Development

No branches or pull requests