Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability limits in the Resource Processor #4293

Open
TonyWildish-BH opened this issue Jan 31, 2025 · 3 comments
Open

Scalability limits in the Resource Processor #4293

TonyWildish-BH opened this issue Jan 31, 2025 · 3 comments
Labels
question Further information is requested

Comments

@TonyWildish-BH
Copy link
Contributor

Description

In my scalability tests last year, I ran a script that attempted to create dozens of resources - workspaces in this case. With the Resource Processor at the default setting of a max pool size of one, my workspaces were all created, but of course I had to wait a long time, as only 5 processes were running at a time.

I tried enlarging the Resource Processor pool to see if I could create more resources in parallel. I just went into the Azure portal and manually increased the pool size from max 1 to max 4, then re-ran my tests. I saw that it did indeed try to create 20 workspaces in one go, but it failed with terraform errors, the APIs were being throttled by Azure, and resources were left in a bad state. Unfortunately, I no longer have the logs, so I can't give the precise message. However, I do recall that terraform was not handling the throttling well.

Anyway, my question is: How can I increase the parallelism of the Resource Processor without terraform falling over?

This likely requires two things:

  • better retry-handling in terraform, so it doesn't just crash and burn when it doesn't need to. That will prevent the failures, but not speed things up.
  • Relaxing the throttling requirements for the Azure APIs, which would speed things up.

Steps

The steps I have tried are:

  1. Increase the size of the Resource Processor pool in the Azure portal.
  2. Write a Bash script to use the tre CLI to create 20-30 workspaces in a tight loop, using --no-wait so they queue.
  3. Watch as the workspaces fail in various stages, due to throttling of terraform API calls by Azure.
@TonyWildish-BH TonyWildish-BH added the question Further information is requested label Jan 31, 2025
@marrobi
Copy link
Member

marrobi commented Jan 31, 2025

Hi @TonyWildish-BH there are some scenarios where parallel operation's hit transient errors - #3177 (comment)

When it's Azure platform issues, all we can do is try work around by blocking parallel operations or throttling.

If you can reproduce the scenario and provide the specific errors then we look at next steps. Thanks.

@TonyWildish-BH
Copy link
Contributor Author

Here's an example. This was with the resource-processor pool upped to 5 instances, and attempting to create 25 workspaces in parallel. The first succeeded, all the others failed with this same error:


Error message: Failed to retrieve resource with azapi_resource.shared_storage, on storage.tf line 37, in resource "azapi_resource" "shared_storage": 37: resource "azapi_resource" "shared_storage" { 
  
checking for presence of existing Resource: (ResourceId  "/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stgwsab87/fileServices/default/shares/vm-shared-storage" / Api Version "2023-05-01"): GET https://management.azure.com/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stgwsab87/fileServices/default/shares/vm-shared-storage 
-------------------------------------------------------------------------------- 
RESPONSE 429: 429 Too Many Requests 
ERROR CODE: TooManyRequests 
-------------------------------------------------------------------------------- 
{ 
  "error": { 
  "code": "TooManyRequests", "message": "The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits" 
 } 
} 
-------------------------------------------------------------------------------- 
Error: creating/updating Scoped Event Subscription (Scope: "/subscriptions/*******/resourceGroups/rg-sdetw34-ws-ab87/providers/Microsoft.Storage/storageAccounts/stalimappwsab87" 
Event Subscription Name: "import-approved-blob-created-ab87"): polling after CreateOrUpdate: polling failed: the Azure API returned the following error: 
Status: "Failed" 
Code: "Storage Notification" 
Message: "The attempt to configure storage notifications for the provided storage account stalimappwsab87 failed. Please ensure that your storage account meets the requirements described at https://aka.ms/storageevents```

@marrobi
Copy link
Member

marrobi commented Feb 7, 2025

Ok, looks like you are hitting "Storage account management operations (list) | 100 per 5 minutes" per subscription/region.

It might happening since we switched to using AD auth for terraform, as access keys are disabled in many environments, as they are not as secure (can be shared).

Image

Looks like it's being worked on here - Azure/terraform-provider-azapi#691 .

If deploying this many workspaces in succession is a requirement for yourselves, suggest contributing to the issue above, as when it gets resolved we can pull the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants