-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azurerm_private_endpoint fails intermittently with retriable error #2 #21293
Comments
We added time_sleep between the Subnet resource and the Private Endpoint, and as expected this solved the initial issue.
Yet another issue with the Subnet resource appeared when creating an azurerm_mssql_virtual_network_rule resource for azurerm_mssql_server by configuring the MSSQL Virtual Network Rule with the Subnet ID of a VM that is created beside the MSSQL Server. Error we're receiving for this is:
It seems that even though both Private Endpoint and MSSQL Virtual Network Rule resources wait for their respective Subnet resources creation completion - once Terraform confirms that the Subnet resource creation is complete the resource itself still remains in the Updating state for some time. I suppose this time frame varies hence the intermittent nature of the occurrence of the mentioned errors. |
Any sight on this? I still see this error with Private Endpoints on 3.90 |
I'm using the v4.10.0 trying to reproduce the issue with the following config: Configprovider "azurerm" { features { resource_group { prevent_deletion_if_contains_resources = false } } } I've tried multiple times and tuned the From the error message, it indicates the subnet is updated during the PE creation. Suspiciously, there are other resources being deployed in the same TF run, which have no dependency on the PE, there both the PE and this other resource are updating the subnet at the same time. For this reason, I've also tried to firstly setup everything, including vnet, subnet, storage account and the PE related dns zones in the first terraform run. Then deploy the following resources in the 2nd run:
With suspicion that each of the above resource will more or less update the subnet/vnet. By inspecting the traffcit, the PUT for each one happens simultaneously: And the result is still successful. I do hope someone could please kindly share a minimal reproducible config to me? It is also worth mentioning about the region that you are targeting to, as Azure services can have different behavior among regions. PS: If you are able to reproduce this issue, please record the correlation request id and submit an Azure support ticket. So that the service team can point out which operation (would be helpful can map to an API) cause the subnet to be "Updated" state. |
A good news, I've successfully reproduced this issue. Will work on a solution soon! |
I've successfully to reproduce this issue (almost every run) via the following steps:
If you are not too lucky, you'll probably encounter the following errors:
My analysis of the errors above are:
|
Any progress on a fix for this issue? |
We're still seeing this on v3.114.0. We deploy lots of resources with PEs to the same subnets. We probably have around 10+ in some subnets, likely applying in parallel. I appreciate this is not this provider's fault and a lot of issues come from Azure's API. That being said, it would be great to have this handled by the provider, as I can't imagine a way of 100% fixing it within Terraform's configuration! How is the progress on fixing this? Do we know any time frames / estimates? |
@kiweezi Our implemented workaround: |
PR #28112 depends on hashicorp/go-azure-sdk#1124. Cc @jackofallops. |
Is there an existing issue for this?
Community Note
Terraform Version
1.2.3
AzureRM Provider Version
3.50.0
Affected Resource(s)/Data Source(s)
azurerm_private_endpoint
Terraform Configuration Files
Debug Output/Panic Output
`Error: waiting for creation of Private Endpoint "mwe-test-1db-pe" (Resource Group "mwe-test-01-rg"): Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/virtualNetworks/mwe-test-01-vnet/subnets/mwe-test-1db-subnet used by resource /subscriptions/***/resourceGroups/mwe-test-01-rg/providers/Microsoft.Network/networkInterfaces/mwe-test-1db-pe.nic.771d4852-46ce-48b3-80ed-f98344f7f778 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."}]`
Expected Behaviour
Either Terraform should automatically retry retriable errors and not fail or PE interactions with Subnet should occur only when Subnet is in a Succeeded state (dependency issue?).
Actual Behaviour
Sometimes during provisioning of a private endpoint, we have seen the following error. Looking into the Azure portal, the Private Endpoint indeed exists and is working. However, we cannot just run terraform apply again, since it does not exist in state. We need to manually delete the PE first (or could manually import it).
Terraform logs show that the subnet resource creation was completed before the creation of the Private Endpoint.
Issue was first encountered when using azurerm version 3.39.1 and also was still present with the latest (at this point) version 3.50.0
Steps to Reproduce
Issue appears randomly and is present both when creating multiple Private Endpoints or a single one.
Important Factoids
As mentioned in #16182 issue - there is a higher chance to encounter the error when multiple Private Endpoints are being created in parallel, but it happens also when creating a single Private Endpoint too. We're trying to workaround the issue by deploying a time_sleep resource, dependent on the Subnet resource and adding a depends_on = property on Private Endpoint resource
References
The bug is pretty much the same as described in an already closed #16182 issue.
The text was updated successfully, but these errors were encountered: