Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

Open
1 task done
tjuoz opened this issue Apr 5, 2023 · 9 comments · May be fixed by #28112
Open
1 task done

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

tjuoz opened this issue Apr 5, 2023 · 9 comments · May be fixed by #28112
Labels
bug service/network upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR v/3.x

Comments

@tjuoz
Copy link

tjuoz commented Apr 5, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.2.3

AzureRM Provider Version

3.50.0

Affected Resource(s)/Data Source(s)

azurerm_private_endpoint

Terraform Configuration Files

resource "azurerm_private_endpoint" "pe_1_db" {
  name                = "mwe-test-1db-pe"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  subnet_id           = azurerm_subnet.subnet_1_db.id

  private_service_connection {
    name = "1db-pe"
    is_manual_connection = false
    private_connection_resource_id = module.database_1_db.sql_server.id
    subresource_names = ["sqlServer"]
  }
}

Debug Output/Panic Output

`Error: waiting for creation of Private Endpoint "mwe-test-1db-pe" (Resource Group "mwe-test-01-rg"): Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/virtualNetworks/mwe-test-01-vnet/subnets/mwe-test-1db-subnet used by resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/networkInterfaces/mwe-test-1db-pe.nic.771d4852-46ce-48b3-80ed-f98344f7f778 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."}]`

Expected Behaviour

Either Terraform should automatically retry retriable errors and not fail or PE interactions with Subnet should occur only when Subnet is in a Succeeded state (dependency issue?).

Actual Behaviour

Sometimes during provisioning of a private endpoint, we have seen the following error. Looking into the Azure portal, the Private Endpoint indeed exists and is working. However, we cannot just run terraform apply again, since it does not exist in state. We need to manually delete the PE first (or could manually import it).

Terraform logs show that the subnet resource creation was completed before the creation of the Private Endpoint.

Issue was first encountered when using azurerm version 3.39.1 and also was still present with the latest (at this point) version 3.50.0

Steps to Reproduce

Issue appears randomly and is present both when creating multiple Private Endpoints or a single one.

Important Factoids

As mentioned in #16182 issue - there is a higher chance to encounter the error when multiple Private Endpoints are being created in parallel, but it happens also when creating a single Private Endpoint too. We're trying to workaround the issue by deploying a time_sleep resource, dependent on the Subnet resource and adding a depends_on = property on Private Endpoint resource

References

The bug is pretty much the same as described in an already closed #16182 issue.

@tjuoz
Copy link
Author

tjuoz commented Apr 7, 2023

We added time_sleep between the Subnet resource and the Private Endpoint, and as expected this solved the initial issue.

resource "time_sleep" "pe_1_db_ts" {
  depends_on      = [azurerm_subnet.subnet_1_db]
  create_duration = "60s"
}

Yet another issue with the Subnet resource appeared when creating an azurerm_mssql_virtual_network_rule resource for azurerm_mssql_server by configuring the MSSQL Virtual Network Rule with the Subnet ID of a VM that is created beside the MSSQL Server. Error we're receiving for this is:

Error: creating MSSQL Virtual Network Rule: (Name "mwe-test-2mo-subnet" / Server Name "mwe-test-1db-sql" / Resource Group "mwe-test-01-rg"): sql.VirtualNetworkRulesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="VirtualNetworkRuleBadRequest" Message="Azure SQL Server Virtual Network Rule encountered an user error: Cannot proceed with operation because subnets mwe-test-2mo-subnet of the virtual network /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/virtualNetworks/mwe-test-01-vnet are not provisioned. They are in Updating state."

It seems that even though both Private Endpoint and MSSQL Virtual Network Rule resources wait for their respective Subnet resources creation completion - once Terraform confirms that the Subnet resource creation is complete the resource itself still remains in the Updating state for some time. I suppose this time frame varies hence the intermittent nature of the occurrence of the mentioned errors.

@dannyger97
Copy link

Any sight on this? I still see this error with Private Endpoints on 3.90

@magodo
Copy link
Collaborator

magodo commented Nov 19, 2024

I'm using the v4.10.0 trying to reproduce the issue with the following config:

Config
provider "azurerm" {
  features {
    resource_group {
      prevent_deletion_if_contains_resources = false
    }
  }
}

resource "azurerm_resource_group" "example" {
name = "mgd-petest"
location = "West Europe"
}

resource "azurerm_storage_account" "example" {
name = "examplemgdaccount"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
account_tier = "Standard"
account_replication_type = "LRS"

allow_nested_items_to_be_public = false
}

resource "azurerm_virtual_network" "example" {
name = "virtnetname"
address_space = ["10.0.0.0/16"]
location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name
}

resource "azurerm_subnet" "example" {
name = "subnetname2"
resource_group_name = azurerm_resource_group.example.name
virtual_network_name = azurerm_virtual_network.example.name
address_prefixes = ["10.0.2.0/24"]
}

resource "azurerm_private_dns_zone" "example" {
name = "privatelink.blob.core.windows.net"
resource_group_name = azurerm_resource_group.example.name
}

resource "azurerm_private_dns_zone_virtual_network_link" "example" {
name = "example-link"
resource_group_name = azurerm_resource_group.example.name
private_dns_zone_name = azurerm_private_dns_zone.example.name
virtual_network_id = azurerm_virtual_network.example.id
}

resource "azurerm_private_endpoint" "example" {
count = 5

name = "example-endpoint-a-${count.index}"
location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name
subnet_id = azurerm_subnet.example.id

private_service_connection {
name = "example-privateserviceconnection-${count.index}"
private_connection_resource_id = azurerm_storage_account.example.id
subresource_names = ["blob"]
is_manual_connection = false
}

private_dns_zone_group {
name = "example-dns-zone-group"
private_dns_zone_ids = [azurerm_private_dns_zone.example.id]
}
}

I've tried multiple times and tuned the count a bit. Even more, I've tried to create another set of PEs when these PEs are creating. All works fine to me.

From the error message, it indicates the subnet is updated during the PE creation. Suspiciously, there are other resources being deployed in the same TF run, which have no dependency on the PE, there both the PE and this other resource are updating the subnet at the same time.

For this reason, I've also tried to firstly setup everything, including vnet, subnet, storage account and the PE related dns zones in the first terraform run. Then deploy the following resources in the 2nd run:

resource "azurerm_private_dns_zone_virtual_network_link" "example" {
  name                  = "example-link"
  resource_group_name   = azurerm_resource_group.example.name
  private_dns_zone_name = azurerm_private_dns_zone.example.name
  virtual_network_id    = azurerm_virtual_network.example.id
}

resource "azurerm_subnet_network_security_group_association" "example" {
  subnet_id                 = azurerm_subnet.example.id
  network_security_group_id = azurerm_network_security_group.example.id
}

resource "azurerm_private_endpoint" "example" {
  count = 2

  name                = "example-endpoint-a-${count.index}"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  subnet_id           = azurerm_subnet.example.id

  private_service_connection {
    name                           = "example-privateserviceconnection-${count.index}"
    private_connection_resource_id = azurerm_storage_account.example.id
    subresource_names              = ["blob"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "example-dns-zone-group"
    private_dns_zone_ids = [azurerm_private_dns_zone.example.id]
  }
}

With suspicion that each of the above resource will more or less update the subnet/vnet. By inspecting the traffcit, the PUT for each one happens simultaneously:

image

And the result is still successful.

I do hope someone could please kindly share a minimal reproducible config to me? It is also worth mentioning about the region that you are targeting to, as Azure services can have different behavior among regions.

PS: If you are able to reproduce this issue, please record the correlation request id and submit an Azure support ticket. So that the service team can point out which operation (would be helpful can map to an API) cause the subnet to be "Updated" state.

@magodo
Copy link
Collaborator

magodo commented Nov 25, 2024

A good news, I've successfully reproduced this issue. Will work on a solution soon!

@magodo
Copy link
Collaborator

magodo commented Nov 26, 2024

I've successfully to reproduce this issue (almost every run) via the following steps:

  1. Provision a resource group, a SA (for PE), a private dns zone (for PE), a vnet with a bunch of subnets, a NSG with a bunch of NSRs:

    Config1
     provider "azurerm" {
       features {
         resource_group {
           prevent_deletion_if_contains_resources = false
         }
       }
     }
    

    locals {
    rules = {
    r0 = {
    priority = 100
    }
    r1 = {
    priority = 101
    }
    r2 = {
    priority = 102
    }
    r3 = {
    priority = 103
    }
    r4 = {
    priority = 104
    }
    r5 = {
    priority = 105
    }
    r6 = {
    priority = 106
    }
    r7 = {
    priority = 107
    }
    r8 = {
    priority = 108
    }
    r9 = {
    priority = 109
    }
    }

     subnets = {
         ssub1 = {
             address_prefixes = ["10.0.1.0/24"]
         }
         ssub2 = {
             address_prefixes = ["10.0.2.0/24"]
         }
         ssub3 = {
             address_prefixes = ["10.0.3.0/24"]
         }
         ssub4 = {
             address_prefixes = ["10.0.4.0/24"]
         }
         ssub5 = {
             address_prefixes = ["10.0.5.0/24"]
         }
         ssub6 = {
             address_prefixes = ["10.0.6.0/24"]
         }
         ssub7 = {
             address_prefixes = ["10.0.7.0/24"]
         }
         ssub8 = {
             address_prefixes = ["10.0.8.0/24"]
         }
         ssub9 = {
             address_prefixes = ["10.0.9.0/24"]
         }
     }
    

    }

    resource "azurerm_resource_group" "example" {
    name = "acctest-mgd-petestt4"
    location = "australiaeast"
    }

    resource "azurerm_storage_account" "example" {
    name = "examplemgdaccount4"
    resource_group_name = azurerm_resource_group.example.name
    location = azurerm_resource_group.example.location
    account_tier = "Standard"
    account_replication_type = "LRS"

    allow_nested_items_to_be_public = false
    }

    resource "azurerm_private_dns_zone" "example" {
    name = "privatelink.blob.core.windows.net"
    resource_group_name = azurerm_resource_group.example.name
    }

    resource "azurerm_virtual_network" "example" {
    name = "virtnetname11"
    address_space = ["10.0.0.0/16"]
    location = azurerm_resource_group.example.location
    resource_group_name = azurerm_resource_group.example.name
    }

    resource "azurerm_subnet" "example" {
    for_each = local.subnets
    name = each.key
    resource_group_name = azurerm_resource_group.example.name
    virtual_network_name = azurerm_virtual_network.example.name
    address_prefixes = each.value.address_prefixes
    }

    resource "azurerm_network_security_group" "example" {
    name = "acceptanceTestSecurityGroup"
    location = azurerm_resource_group.example.location
    resource_group_name = azurerm_resource_group.example.name
    }

    resource "azurerm_network_security_rule" "example" {
    for_each = local.rules
    name = each.key
    network_security_group_name = azurerm_network_security_group.example.name
    resource_group_name = azurerm_resource_group.example.name
    priority = each.value.priority
    direction = "Outbound"
    access = "Allow"
    protocol = "Tcp"
    source_port_range = ""
    destination_port_range = "
    "
    source_address_prefix = ""
    destination_address_prefix = "
    "
    }

    Apply the above config first.

  2. Add the following configurations, which basically create a PE inside each subnet, targeting to the same SA. Meanwhile, associate each subnet with the same NSG:

    Config2
     resource "azurerm_subnet_network_security_group_association" "example" {
       for_each = local.subnets
       subnet_id                 = azurerm_subnet.example[each.key].id
       network_security_group_id = azurerm_network_security_group.example.id
     }
    

    resource "azurerm_private_endpoint" "example" {
    for_each = local.subnets
    name = "example-endpoint-in-${each.key}"
    location = azurerm_resource_group.example.location
    resource_group_name = azurerm_resource_group.example.name
    subnet_id = azurerm_subnet.example[each.key].id

    private_service_connection {
    name = "example-privateserviceconnection-${each.key}"
    private_connection_resource_id = azurerm_storage_account.example.id
    subresource_names = ["blob"]
    is_manual_connection = false
    }

    private_dns_zone_group {
    name = "example-dns-zone-group"
    private_dns_zone_ids = [azurerm_private_dns_zone.example.id]
    }
    }

    Apply this with a large parallelism, e.g. terraform apply -parallelism=100.

If you are not too lucky, you'll probably encounter the following errors:

  1. The PUT request for creating the PE returns 429, with the following response:

    {
      "error": {
        "code": "RetryableError",
        "message": "A retryable error occurred.",
        "details": [
          {
            "code": "ReferencedResourceNotProvisioned",
            "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetname11/subnets/ssub8 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/privateEndpoints/example-endpoint-in-ssub8 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."
          }
        ]
      }
    }
  2. The PUT request for updating the subnet (to associate the NSG) returns 429, with the following response:

    {
      "error": {
        "code": "RetryableError",
        "message": "A retryable error occurred.",
        "details": [
          {
            "code": "ReferencedResourceNotProvisioned",
            "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/networkInterfaces/example-endpoint-in-ssub1.nic.aafbedf6-7b9f-4e20-82b3-37429021c8c9/ipConfigurations/privateEndpointIpConfig.cfc01f41-a634-4997-bd0c-feb67e6c40c6 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetname11/subnets/ssub1 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutNicOperation."
          }
        ]
      }
    }
  3. The LRO polling of the PE returns either of the errors below:

     ```json
     {
       "status": "Failed",
       "error": {
         "code": "RetryableError",
         "message": "A retryable error occurred.",
         "details": [
           {
             "code": "ReferencedResourceNotProvisioned",
             "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetnamex/subnets/ssub3 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/networkInterfaces/example-endpoint-in-ssub3.nic.b1ecd3e7-fbcb-4e40-960b-5e7937dd8b66 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."
           }
         ]
       }
     }
     ```
    
     ```json
     {
       "status": "Failed",
       "error": {
         "code": "StorageAccountOperationInProgress",
         "message": "Call to Microsoft.Storage/storageAccounts failed. Error message: An operation is currently performing on this storage account that requires exclusive access.",
         "details": []
       }
     }
     ```
    

My analysis of the errors above are:

  • Both the NSG association operation and PE creation (which will create an NIC winthin the subnet) operation will update the subnet, causing the subnet to be in updated state. Given the same subnet, depending on which request comes first, the second will get a 429.

    The good news is that the 429 is handled by the SDK, which will be retried.

    The bad news is that in some rare case, both requests are handled (nearly) simultaneously. Each of them saw the subnet is in a ready status, so they all return 201 and proceed the LRO. Until a later point of time, the PE operation will re-check the subnet status and see it is in "Updating" status, ini which case, the LRO polling request will return the 3.1 error response. Unfortunately, this is not retried in today's SDK.

  • Since a bunch of PEs are being created, targeting to the same SA. They likely require changes to the SA as well. Concurrent changes to the SA then causes the 3.2 error response. Same as 3.1 error response, it is not retried in today's SDK.

@Tolbin400
Copy link

Any progress on a fix for this issue?

@kiweezi
Copy link

kiweezi commented Jan 14, 2025

We're still seeing this on v3.114.0.
Thanks for the rundown @magodo. It seems my team have been seeing this for some time, but had no idea what was causing it.

We deploy lots of resources with PEs to the same subnets. We probably have around 10+ in some subnets, likely applying in parallel.
This combined with the issues of deploying multiple subnets in a vnet, means networking with Azure is a nightmare to setup in Terraform.

I appreciate this is not this provider's fault and a lot of issues come from Azure's API. That being said, it would be great to have this handled by the provider, as I can't imagine a way of 100% fixing it within Terraform's configuration!

How is the progress on fixing this? Do we know any time frames / estimates?
Is there any suggested workarounds for cases like ours in the meantime?

@tjuoz
Copy link
Author

tjuoz commented Jan 14, 2025

@kiweezi Our implemented workaround:
Add a time_sleep Terraform resource definition between the subnet and private_endpoint resources.
Add an explicit dependency for mssql vnet rule with the subnet.

@magodo
Copy link
Collaborator

magodo commented Feb 3, 2025

PR #28112 depends on hashicorp/go-azure-sdk#1124. Cc @jackofallops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug service/network upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR v/3.x
Projects
None yet
9 participants