Scalability limits in the Resource Processor #4293

TonyWildish-BH · 2025-01-31T11:23:15Z

Description

In my scalability tests last year, I ran a script that attempted to create dozens of resources - workspaces in this case. With the Resource Processor at the default setting of a max pool size of one, my workspaces were all created, but of course I had to wait a long time, as only 5 processes were running at a time.

I tried enlarging the Resource Processor pool to see if I could create more resources in parallel. I just went into the Azure portal and manually increased the pool size from max 1 to max 4, then re-ran my tests. I saw that it did indeed try to create 20 workspaces in one go, but it failed with terraform errors, the APIs were being throttled by Azure, and resources were left in a bad state. Unfortunately, I no longer have the logs, so I can't give the precise message. However, I do recall that terraform was not handling the throttling well.

Anyway, my question is: How can I increase the parallelism of the Resource Processor without terraform falling over?

This likely requires two things:

better retry-handling in terraform, so it doesn't just crash and burn when it doesn't need to. That will prevent the failures, but not speed things up.
Relaxing the throttling requirements for the Azure APIs, which would speed things up.

Steps

The steps I have tried are:

Increase the size of the Resource Processor pool in the Azure portal.
Write a Bash script to use the tre CLI to create 20-30 workspaces in a tight loop, using --no-wait so they queue.
Watch as the workspaces fail in various stages, due to throttling of terraform API calls by Azure.

The text was updated successfully, but these errors were encountered:

marrobi · 2025-01-31T12:49:50Z

Hi @TonyWildish-BH there are some scenarios where parallel operation's hit transient errors - #3177 (comment)

When it's Azure platform issues, all we can do is try work around by blocking parallel operations or throttling.

If you can reproduce the scenario and provide the specific errors then we look at next steps. Thanks.

TonyWildish-BH · 2025-02-07T14:25:35Z

Here's an example. This was with the resource-processor pool upped to 5 instances, and attempting to create 25 workspaces in parallel. The first succeeded, all the others failed with this same error:


Error message: Failed to retrieve resource with azapi_resource.shared_storage, on storage.tf line 37, in resource "azapi_resource" "shared_storage": 37: resource "azapi_resource" "shared_storage" { 
  
checking for presence of existing Resource: (ResourceId  "/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stgwsab87/fileServices/default/shares/vm-shared-storage" / Api Version "2023-05-01"): GET https://management.azure.com/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stgwsab87/fileServices/default/shares/vm-shared-storage 
-------------------------------------------------------------------------------- 
RESPONSE 429: 429 Too Many Requests 
ERROR CODE: TooManyRequests 
-------------------------------------------------------------------------------- 
{ 
  "error": { 
  "code": "TooManyRequests", "message": "The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits" 
 } 
} 
-------------------------------------------------------------------------------- 
Error: creating/updating Scoped Event Subscription (Scope: "/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stalimappwsab87" 
Event Subscription Name: "import-approved-blob-created-ab87"): polling after CreateOrUpdate: polling failed: the Azure API returned the following error: 
Status: "Failed" 
Code: "Storage Notification" 
Message: "The attempt to configure storage notifications for the provided storage account stalimappwsab87 failed. Please ensure that your storage account meets the requirements described at https://aka.ms/storageevents```

marrobi · 2025-02-07T16:21:30Z

Ok, looks like you are hitting "Storage account management operations (list) | 100 per 5 minutes" per subscription/region.

It might happening since we switched to using AD auth for terraform, as access keys are disabled in many environments, as they are not as secure (can be shared).

Looks like it's being worked on here - Azure/terraform-provider-azapi#691 .

If deploying this many workspaces in succession is a requirement for yourselves, suggest contributing to the issue above, as when it gets resolved we can pull the fix.

TonyWildish-BH added the question Further information is requested label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability limits in the Resource Processor #4293

Scalability limits in the Resource Processor #4293

TonyWildish-BH commented Jan 31, 2025

marrobi commented Jan 31, 2025

TonyWildish-BH commented Feb 7, 2025

marrobi commented Feb 7, 2025

Scalability limits in the Resource Processor #4293

Scalability limits in the Resource Processor #4293

Comments

TonyWildish-BH commented Jan 31, 2025

Description

Steps

marrobi commented Jan 31, 2025

TonyWildish-BH commented Feb 7, 2025

marrobi commented Feb 7, 2025