azurerm_private_endpoint fails intermittently with retriable error #2 #21293

tjuoz · 2023-04-05T07:42:05Z

Is there an existing issue for this?

I have searched the existing issues

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.2.3

AzureRM Provider Version

3.50.0

Affected Resource(s)/Data Source(s)

azurerm_private_endpoint

Terraform Configuration Files

resource "azurerm_private_endpoint" "pe_1_db" {
  name                = "mwe-test-1db-pe"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  subnet_id           = azurerm_subnet.subnet_1_db.id

  private_service_connection {
    name = "1db-pe"
    is_manual_connection = false
    private_connection_resource_id = module.database_1_db.sql_server.id
    subresource_names = ["sqlServer"]
  }
}

Debug Output/Panic Output

`Error: waiting for creation of Private Endpoint "mwe-test-1db-pe" (Resource Group "mwe-test-01-rg"): Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/virtualNetworks/mwe-test-01-vnet/subnets/mwe-test-1db-subnet used by resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/networkInterfaces/mwe-test-1db-pe.nic.771d4852-46ce-48b3-80ed-f98344f7f778 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."}]`

Expected Behaviour

Either Terraform should automatically retry retriable errors and not fail or PE interactions with Subnet should occur only when Subnet is in a Succeeded state (dependency issue?).

Actual Behaviour

Sometimes during provisioning of a private endpoint, we have seen the following error. Looking into the Azure portal, the Private Endpoint indeed exists and is working. However, we cannot just run terraform apply again, since it does not exist in state. We need to manually delete the PE first (or could manually import it).

Terraform logs show that the subnet resource creation was completed before the creation of the Private Endpoint.

Issue was first encountered when using azurerm version 3.39.1 and also was still present with the latest (at this point) version 3.50.0

Steps to Reproduce

Issue appears randomly and is present both when creating multiple Private Endpoints or a single one.

Important Factoids

As mentioned in #16182 issue - there is a higher chance to encounter the error when multiple Private Endpoints are being created in parallel, but it happens also when creating a single Private Endpoint too. We're trying to workaround the issue by deploying a time_sleep resource, dependent on the Subnet resource and adding a depends_on = property on Private Endpoint resource

References

The bug is pretty much the same as described in an already closed #16182 issue.

tjuoz · 2023-04-07T07:44:55Z

We added time_sleep between the Subnet resource and the Private Endpoint, and as expected this solved the initial issue.

resource "time_sleep" "pe_1_db_ts" {
  depends_on      = [azurerm_subnet.subnet_1_db]
  create_duration = "60s"
}

Yet another issue with the Subnet resource appeared when creating an azurerm_mssql_virtual_network_rule resource for azurerm_mssql_server by configuring the MSSQL Virtual Network Rule with the Subnet ID of a VM that is created beside the MSSQL Server. Error we're receiving for this is:

Error: creating MSSQL Virtual Network Rule: (Name "mwe-test-2mo-subnet" / Server Name "mwe-test-1db-sql" / Resource Group "mwe-test-01-rg"): sql.VirtualNetworkRulesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="VirtualNetworkRuleBadRequest" Message="Azure SQL Server Virtual Network Rule encountered an user error: Cannot proceed with operation because subnets mwe-test-2mo-subnet of the virtual network /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/virtualNetworks/mwe-test-01-vnet are not provisioned. They are in Updating state."

It seems that even though both Private Endpoint and MSSQL Virtual Network Rule resources wait for their respective Subnet resources creation completion - once Terraform confirms that the Subnet resource creation is complete the resource itself still remains in the Updating state for some time. I suppose this time frame varies hence the intermittent nature of the occurrence of the mentioned errors.

dannyger97 · 2024-03-20T06:34:40Z

Any sight on this? I still see this error with Private Endpoints on 3.90

magodo · 2024-11-19T06:42:33Z

I'm using the v4.10.0 trying to reproduce the issue with the following config:

Config

provider "azurerm" {
  features {
    resource_group {
      prevent_deletion_if_contains_resources = false
    }
  }
}
resource "azurerm_resource_group" "example" {

name     = "mgd-petest"

location = "West Europe"

}
resource "azurerm_storage_account" "example" {

name                     = "examplemgdaccount"

resource_group_name      = azurerm_resource_group.example.name

location                 = azurerm_resource_group.example.location

account_tier             = "Standard"

account_replication_type = "LRS"
allow_nested_items_to_be_public = false

}
resource "azurerm_virtual_network" "example" {

name                = "virtnetname"

address_space       = ["10.0.0.0/16"]

location            = azurerm_resource_group.example.location

resource_group_name = azurerm_resource_group.example.name

}
resource "azurerm_subnet" "example" {

name                 = "subnetname2"

resource_group_name  = azurerm_resource_group.example.name

virtual_network_name = azurerm_virtual_network.example.name

address_prefixes     = ["10.0.2.0/24"]

}
resource "azurerm_private_dns_zone" "example" {

name                = "privatelink.blob.core.windows.net"

resource_group_name = azurerm_resource_group.example.name

}
resource "azurerm_private_dns_zone_virtual_network_link" "example" {

name                  = "example-link"

resource_group_name   = azurerm_resource_group.example.name

private_dns_zone_name = azurerm_private_dns_zone.example.name

virtual_network_id    = azurerm_virtual_network.example.id

}
resource "azurerm_private_endpoint" "example" {

count = 5
name                = "example-endpoint-a-${count.index}"

location            = azurerm_resource_group.example.location

resource_group_name = azurerm_resource_group.example.name

subnet_id           = azurerm_subnet.example.id
private_service_connection {

name                           = "example-privateserviceconnection-${count.index}"

private_connection_resource_id = azurerm_storage_account.example.id

subresource_names              = ["blob"]

is_manual_connection           = false

}
private_dns_zone_group {

name                 = "example-dns-zone-group"

private_dns_zone_ids = [azurerm_private_dns_zone.example.id]

}

}

I've tried multiple times and tuned the count a bit. Even more, I've tried to create another set of PEs when these PEs are creating. All works fine to me.

From the error message, it indicates the subnet is updated during the PE creation. Suspiciously, there are other resources being deployed in the same TF run, which have no dependency on the PE, there both the PE and this other resource are updating the subnet at the same time.

For this reason, I've also tried to firstly setup everything, including vnet, subnet, storage account and the PE related dns zones in the first terraform run. Then deploy the following resources in the 2nd run:

resource "azurerm_private_dns_zone_virtual_network_link" "example" {
  name                  = "example-link"
  resource_group_name   = azurerm_resource_group.example.name
  private_dns_zone_name = azurerm_private_dns_zone.example.name
  virtual_network_id    = azurerm_virtual_network.example.id
}

resource "azurerm_subnet_network_security_group_association" "example" {
  subnet_id                 = azurerm_subnet.example.id
  network_security_group_id = azurerm_network_security_group.example.id
}

resource "azurerm_private_endpoint" "example" {
  count = 2

  name                = "example-endpoint-a-${count.index}"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  subnet_id           = azurerm_subnet.example.id

  private_service_connection {
    name                           = "example-privateserviceconnection-${count.index}"
    private_connection_resource_id = azurerm_storage_account.example.id
    subresource_names              = ["blob"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "example-dns-zone-group"
    private_dns_zone_ids = [azurerm_private_dns_zone.example.id]
  }
}

With suspicion that each of the above resource will more or less update the subnet/vnet. By inspecting the traffcit, the PUT for each one happens simultaneously:

And the result is still successful.

I do hope someone could please kindly share a minimal reproducible config to me? It is also worth mentioning about the region that you are targeting to, as Azure services can have different behavior among regions.

PS: If you are able to reproduce this issue, please record the correlation request id and submit an Azure support ticket. So that the service team can point out which operation (would be helpful can map to an API) cause the subnet to be "Updated" state.

magodo · 2024-11-25T06:30:36Z

A good news, I've successfully reproduced this issue. Will work on a solution soon!

magodo · 2024-11-26T04:36:57Z

I've successfully to reproduce this issue (almost every run) via the following steps:

Provision a resource group, a SA (for PE), a private dns zone (for PE), a vnet with a bunch of subnets, a NSG with a bunch of NSRs:

Config1

 provider "azurerm" {
   features {
     resource_group {
       prevent_deletion_if_contains_resources = false
     }
   }
 }
locals {

rules = {

r0 = {

priority = 100

}

r1 = {

priority = 101

}

r2 = {

priority = 102

}

r3 = {

priority = 103

}

r4 = {

priority = 104

}

r5 = {

priority = 105

}

r6 = {

priority = 106

}

r7 = {

priority = 107

}

r8 = {

priority = 108

}

r9 = {

priority = 109

}

}
 subnets = {
     ssub1 = {
         address_prefixes = ["10.0.1.0/24"]
     }
     ssub2 = {
         address_prefixes = ["10.0.2.0/24"]
     }
     ssub3 = {
         address_prefixes = ["10.0.3.0/24"]
     }
     ssub4 = {
         address_prefixes = ["10.0.4.0/24"]
     }
     ssub5 = {
         address_prefixes = ["10.0.5.0/24"]
     }
     ssub6 = {
         address_prefixes = ["10.0.6.0/24"]
     }
     ssub7 = {
         address_prefixes = ["10.0.7.0/24"]
     }
     ssub8 = {
         address_prefixes = ["10.0.8.0/24"]
     }
     ssub9 = {
         address_prefixes = ["10.0.9.0/24"]
     }
 }

}
resource "azurerm_resource_group" "example" {

name     = "acctest-mgd-petestt4"

location = "australiaeast"

}
resource "azurerm_storage_account" "example" {

name                     = "examplemgdaccount4"

resource_group_name      = azurerm_resource_group.example.name

location                 = azurerm_resource_group.example.location

account_tier             = "Standard"

account_replication_type = "LRS"
allow_nested_items_to_be_public = false

}
resource "azurerm_private_dns_zone" "example" {

name                = "privatelink.blob.core.windows.net"

resource_group_name = azurerm_resource_group.example.name

}
resource "azurerm_virtual_network" "example" {

name                = "virtnetname11"

address_space       = ["10.0.0.0/16"]

location            = azurerm_resource_group.example.location

resource_group_name = azurerm_resource_group.example.name

}
resource "azurerm_subnet" "example" {

for_each = local.subnets

name                 = each.key

resource_group_name  = azurerm_resource_group.example.name

virtual_network_name = azurerm_virtual_network.example.name

address_prefixes     = each.value.address_prefixes

}
resource "azurerm_network_security_group" "example" {

name                = "acceptanceTestSecurityGroup"

location            = azurerm_resource_group.example.location

resource_group_name = azurerm_resource_group.example.name

}
resource "azurerm_network_security_rule" "example" {

for_each = local.rules

name                        = each.key

network_security_group_name = azurerm_network_security_group.example.name

resource_group_name         = azurerm_resource_group.example.name

priority                    = each.value.priority

direction                   = "Outbound"

access                      = "Allow"

protocol                    = "Tcp"

source_port_range           = ""

destination_port_range      = ""

source_address_prefix       = ""

destination_address_prefix  = ""

}

Apply the above config first.

Add the following configurations, which basically create a PE inside each subnet, targeting to the same SA. Meanwhile, associate each subnet with the same NSG:

Config2

 resource "azurerm_subnet_network_security_group_association" "example" {
   for_each = local.subnets
   subnet_id                 = azurerm_subnet.example[each.key].id
   network_security_group_id = azurerm_network_security_group.example.id
 }
resource "azurerm_private_endpoint" "example" {

for_each = local.subnets

name                = "example-endpoint-in-${each.key}"

location            = azurerm_resource_group.example.location

resource_group_name = azurerm_resource_group.example.name

subnet_id           = azurerm_subnet.example[each.key].id
private_service_connection {

name                           = "example-privateserviceconnection-${each.key}"

private_connection_resource_id = azurerm_storage_account.example.id

subresource_names              = ["blob"]

is_manual_connection           = false

}
private_dns_zone_group {

name                 = "example-dns-zone-group"

private_dns_zone_ids = [azurerm_private_dns_zone.example.id]

}

}

Apply this with a large parallelism, e.g. terraform apply -parallelism=100.

If you are not too lucky, you'll probably encounter the following errors:

The PUT request for creating the PE returns 429, with the following response:

{
  "error": {
    "code": "RetryableError",
    "message": "A retryable error occurred.",
    "details": [
      {
        "code": "ReferencedResourceNotProvisioned",
        "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetname11/subnets/ssub8 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/privateEndpoints/example-endpoint-in-ssub8 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."
      }
    ]
  }
}

The PUT request for updating the subnet (to associate the NSG) returns 429, with the following response:

{
  "error": {
    "code": "RetryableError",
    "message": "A retryable error occurred.",
    "details": [
      {
        "code": "ReferencedResourceNotProvisioned",
        "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/networkInterfaces/example-endpoint-in-ssub1.nic.aafbedf6-7b9f-4e20-82b3-37429021c8c9/ipConfigurations/privateEndpointIpConfig.cfc01f41-a634-4997-bd0c-feb67e6c40c6 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetname11/subnets/ssub1 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutNicOperation."
      }
    ]
  }
}

The LRO polling of the PE returns either of the errors below:

 ```json
 {
   "status": "Failed",
   "error": {
     "code": "RetryableError",
     "message": "A retryable error occurred.",
     "details": [
       {
         "code": "ReferencedResourceNotProvisioned",
         "message": "Cannot proceed with operation because resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/virtualNetworks/virtnetnamex/subnets/ssub3 used by resource /subscriptions/xxxx/resourceGroups/acctest-mgd-petestt4/providers/Microsoft.Network/networkInterfaces/example-endpoint-in-ssub3.nic.b1ecd3e7-fbcb-4e40-960b-5e7937dd8b66 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."
       }
     ]
   }
 }
 ```

 ```json
 {
   "status": "Failed",
   "error": {
     "code": "StorageAccountOperationInProgress",
     "message": "Call to Microsoft.Storage/storageAccounts failed. Error message: An operation is currently performing on this storage account that requires exclusive access.",
     "details": []
   }
 }
 ```

My analysis of the errors above are:

Both the NSG association operation and PE creation (which will create an NIC winthin the subnet) operation will update the subnet, causing the subnet to be in updated state. Given the same subnet, depending on which request comes first, the second will get a 429.

The good news is that the 429 is handled by the SDK, which will be retried.

The bad news is that in some rare case, both requests are handled (nearly) simultaneously. Each of them saw the subnet is in a ready status, so they all return 201 and proceed the LRO. Until a later point of time, the PE operation will re-check the subnet status and see it is in "Updating" status, ini which case, the LRO polling request will return the 3.1 error response. Unfortunately, this is not retried in today's SDK.
Since a bunch of PEs are being created, targeting to the same SA. They likely require changes to the SA as well. Concurrent changes to the SA then causes the 3.2 error response. Same as 3.1 error response, it is not retried in today's SDK.

Tolbin400 · 2024-12-16T00:14:51Z

Any progress on a fix for this issue?

kiweezi · 2025-01-14T11:18:48Z

We're still seeing this on v3.114.0.
Thanks for the rundown @magodo. It seems my team have been seeing this for some time, but had no idea what was causing it.

We deploy lots of resources with PEs to the same subnets. We probably have around 10+ in some subnets, likely applying in parallel.
This combined with the issues of deploying multiple subnets in a vnet, means networking with Azure is a nightmare to setup in Terraform.

I appreciate this is not this provider's fault and a lot of issues come from Azure's API. That being said, it would be great to have this handled by the provider, as I can't imagine a way of 100% fixing it within Terraform's configuration!

How is the progress on fixing this? Do we know any time frames / estimates?
Is there any suggested workarounds for cases like ours in the meantime?

tjuoz · 2025-01-14T20:24:35Z

@kiweezi Our implemented workaround:
Add a time_sleep Terraform resource definition between the subnet and private_endpoint resources.
Add an explicit dependency for mssql vnet rule with the subnet.

magodo · 2025-02-03T22:55:59Z

PR #28112 depends on hashicorp/go-azure-sdk#1124. Cc @jackofallops.

tjuoz added the bug label Apr 5, 2023

github-actions bot removed the bug label Apr 5, 2023

abij mentioned this issue Apr 6, 2023

feature request: parallelism parameter for resources with count hashicorp/terraform#14258

Open

tombuildsstuff added the upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR label Apr 11, 2023

manicminer added the service/network label Apr 24, 2023

marrobi mentioned this issue Jul 19, 2023

Changes to fix dependency issues and app insights TF issues microsoft/AzureTRE#3581

Merged

catriona-m added the v/3.x label Jul 19, 2023

This was referenced Nov 26, 2024

Virtual machine creation fails with RetryableError #16928

Open

azurerm_virtual_machine_extension.vmext not honoring retry, failing #20005

Open

getting too often RetryableError while VM provisioning #11257

Open

rcskosir added the bug label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

tjuoz commented Apr 5, 2023

tjuoz commented Apr 7, 2023

dannyger97 commented Mar 20, 2024

magodo commented Nov 19, 2024 •

edited

Loading

magodo commented Nov 25, 2024

magodo commented Nov 26, 2024 •

edited

Loading

Tolbin400 commented Dec 16, 2024

kiweezi commented Jan 14, 2025

tjuoz commented Jan 14, 2025

magodo commented Feb 3, 2025

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

azurerm_private_endpoint fails intermittently with retriable error #2 #21293

Comments

tjuoz commented Apr 5, 2023

Is there an existing issue for this?

Community Note

Terraform Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

Debug Output/Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

References

tjuoz commented Apr 7, 2023

dannyger97 commented Mar 20, 2024

magodo commented Nov 19, 2024 • edited Loading

magodo commented Nov 25, 2024

magodo commented Nov 26, 2024 • edited Loading

Tolbin400 commented Dec 16, 2024

kiweezi commented Jan 14, 2025

tjuoz commented Jan 14, 2025

magodo commented Feb 3, 2025

magodo commented Nov 19, 2024 •

edited

Loading

magodo commented Nov 26, 2024 •

edited

Loading