Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag update to ExpressRoute Gateway causes all ER circuits to become disconnected #13368

Closed
robertdias opened this issue Sep 15, 2021 · 8 comments · Fixed by #16680
Closed

Tag update to ExpressRoute Gateway causes all ER circuits to become disconnected #13368

robertdias opened this issue Sep 15, 2021 · 8 comments · Fixed by #16680

Comments

@robertdias
Copy link

robertdias commented Sep 15, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

2.73.0

Affected Resource(s)

  • azurerm_express_route_connection
  • azurerm_express_route_gateway

Terraform Configuration Files

Imported from workflow:

           TF_VAR_azure_tags : "{
             Environment  = \"Production\",
             Repo         = \"infra\"
             }"

tfvars map:

er_gateway_list = {
  "P-TX-DALLAS-1A" = {
    resource_group_name = "P-TX-NET-WAN"
    location            = "usgovtexas"
    virtual_hub_name    = "P-TX-HUB-1A"
    scale_units         = 5
  },

main.tf

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "=2.73.0"
    }
  }
}

terraform {
  backend "azurerm" {}
}

provider "azurerm" {
  skip_provider_registration = "true"
  features {}
}

...

resource "azurerm_express_route_gateway" "er_gateways" {
  for_each = var.er_gateway_list

  name                = each.key
  resource_group_name = each.value.resource_group_name
  location            = each.value.location
  virtual_hub_id      = azurerm_virtual_hub.virtual_hub[each.value.virtual_hub_name].id
  scale_units         = each.value.scale_units
  tags                = var.azure_tags
}

plan and apply:

Initializing the backend...

Successfully configured the backend "azurerm"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Finding hashicorp/azurerm versions matching "2.73.0"...
- Finding hashicorp/azuread versions matching "1.6.0"...
- Installing hashicorp/azurerm v2.73.0...
- Installed hashicorp/azurerm v2.73.0 (signed by HashiCorp)
- Installing hashicorp/azuread v1.6.0...
- Installed hashicorp/azuread v1.6.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider selections 
it made above. Include this file in your version control repository so that Terraform can
 guarantee to make the same selections by default when you run 
"terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" 
to see any changes that are required for your infrastructure. All Terraform commands 
should now work.

If you ever set or change modules or backend configuration for Terraform, rerun this
 command to reinitialize your working directory. If you forget, other commands will 
detect it and remind you to do so if necessary.

...
azurerm_express_route_gateway.er_gateways["P-TX-DALLAS-1A"]: Refreshing state... [id=/subscriptions/***/resourceGroups/P-TX-NET-WAN/providers/Microsoft.Network/expressRouteGateways/P-TX-DALLAS-1A]
...


Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement
 <= read (data resources)

Terraform will perform the following actions:

  # azurerm_express_route_gateway.er_gateways["P-TX-DALLAS-1A"] will be updated in-place
  ~ resource "azurerm_express_route_gateway" "er_gateways" ***
        id                  = "/subscriptions/***/resourceGroups/P-TX-NET-WAN/providers/Microsoft.Network/expressRouteGateways/P-TX-DALLAS-1A"
        name                = "P-TX-DALLAS-1A"
      ~ tags                = ***
          + "Repo"        = "infra"
            # (1 unchanged element hidden)
        ***
        # (4 unchanged attributes hidden)
    ***
...
...
azurerm_express_route_gateway.er_gateways["P-TX-DALLAS-1A"]: Modifications complete after 15m22s [id=/subscriptions/***/resourceGroups/P-TX-NET-WAN/providers/Microsoft.Network/expressRouteGateways/P-TX-DALLAS-1A]

Expected Behavior

We expected the ER Gateway to have the 'infra' tag applied on it. 'Update-in-place'

Actual Behavior

The tag was applied, then all existing ER Connections were disconnected. This caused a complete outage for customer connectivity into Azure.

Steps to Reproduce

  1. Apply a tag to an existing ER gateway which has ER Connections into it.

Important Factoids

  • This is running in Azure Government.
  • The ExpressRoute connections are maintained in a separate repo and tfstate file.

References

  • #0000
@seanlok
Copy link
Contributor

seanlok commented Oct 21, 2021

This Happen when the scale unit is updated from code too.

@paul-hugill
Copy link

paul-hugill commented Oct 22, 2021

It is not specific to azure/government, I saw this with Azure Global as well.

So far MS Support have not been able to provide any useful details and want us to test manually editing tags to test if it happens.
Obviously that needs some planning with the potential that it will disconnect the circuits.

This was the response from them:

We've discussed internally and TF is running different PUTs (towards Gateway+Circuit) in the process. 
We can see that, then we see the peer IP's "changing" (well, back to their original IPs) and then that downtime.

In order for us to proceed further on our end, you will need to show that this happens via a non-script
deployment (portal/CLI/PS).

The other alternative is for the you to approach Hashicorp for assistance to find out why/remove the
commands that are causing the PUT and outage.

I should also add that we updated tags on Gateways and Circuits at the same time but the connection itself was handled manually so that was not in code.

@xuzhang3
Copy link
Contributor

xuzhang3 commented Nov 30, 2021

@paul-hugill connection lost is because the GW updating operation ignored connections when send request to service, result in service, the update operation will remove all the connections. But we cannot just set the connections back to the request body as SDK have implement the MarshalJSON() for ExpressRouteGatewayProperties which will ignore the connections in the request body, this issue should fix the SDK issue first. You can find the ExpressRouteGatewayProperties.MarshalJSON() at line 15296 in https://github.com/hashicorp/terraform-provider-azurerm/blob/main/vendor/github.com/Azure/azure-sdk-for-go/services/network/mgmt/2021-02-01/network/models.go.

@hriaz
Copy link

hriaz commented Jan 31, 2022

I can confirm this happens under Azure Virtual WAN as well for Express Route circuits. A tag update caused 3 of the customer's production Express Routes down. Considering the operation is a long-running operation, it took almost 24 minutes to re-connect everything. Not fun in global infrastructure.

@bubbletroubles
Copy link
Contributor

Is this issue resolved now? AzureRM Provider 3.5.0 has included the updated SDK which was required to fix this.

@xuzhang3
Copy link
Contributor

xuzhang3 commented May 6, 2022

@bubbletroubles I submit a PR #16680 to fix this issue.

@github-actions github-actions bot added this to the v3.6.0 milestone May 12, 2022
@github-actions
Copy link

This functionality has been released in v3.6.0 of the Terraform Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants