Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FormRecognizer] Service is throttling #31914

Closed
kinelski opened this issue Oct 20, 2022 · 4 comments
Closed

[FormRecognizer] Service is throttling #31914

kinelski opened this issue Oct 20, 2022 · 4 comments
Assignees
Labels
Cognitive - Form Recognizer Service Attention Workflow: This issue is responsible by Azure service team. Service This issue points to a problem in the service.
Milestone

Comments

@kinelski
Copy link
Member

kinelski commented Oct 20, 2022

Description

This issue collects multiple failures we've seen in the Form Recognizer test pipeline for the last couple of months. All these issues are believed to have the same underlying cause: throttling on the service side. They have all been reported to the Form Recognizer service team and are under investigation.

Failures

All failures described below are not happening deterministically and affect multiple different tests.

Content is not accessible

  • API version: 2022-08-31
  • Thrown when: in the LRO POST request when calling BuildModel (in DocumentModelAdministrationClient).
  • Frequency: daily
  • Error message example:
Azure.RequestFailedException : Invalid request.
Status: 400 (Bad Request)
ErrorCode: InvalidRequest

Content:
{
  "error": {
    "code": "InvalidRequest",
    "message": "Invalid request.",
    "innererror": {
      "code": "ContentSourceNotAccessible",
      "message": "Content is not accessible: Could not retrieve build data within 60 seconds."
    }
  }
}

Generic error during training

  • API version: 2.1
  • Thrown when: in the LRO GET request when polling CreateCustomFormModelOperation.
  • Frequency: daily
  • Error message example:
Azure.RequestFailedException : Invalid model created with ID 94ae7f2b-d7c1-4509-9df1-00f7e802d956
Status: 200 (OK)
ErrorCode: 3014

Additional Information:
error-0: 3014: Generic error during training.

Content:

Could not access Azure blob storage account

  • API version: 2.1
  • Thrown when: in the LRO POST request when calling StartTraining (in FormTrainingClient).
  • Frequency: only appears around one day every 1.5 weeks but affects multiple v2.1 tests on the day it appears. It's always accompanied by errors "Managed Identity credential was rejected by the storage service" described below.
  • Error message example:
Azure.RequestFailedException : Could not access Azure blob storage account.
Status: 400 (Bad Request)
ErrorCode: 2011

Content:
{"error":{"code":"2011","message":"Could not access Azure blob storage account."}}

Managed Identity credential was rejected by the storage service

  • API version: 2.1
  • Thrown when: in the LRO GET request when polling CreateCustomFormModelOperation.
  • Frequency: only appears around one day every 1.5 weeks but affects multiple v2.1 tests on the day it appears. It's always accompanied by errors "Could not access Azure blob storage account" described above.
  • Error message example:
Azure.RequestFailedException : Invalid model created with ID 5a1a952e-58a6-4d5c-80db-d6e79696f49b
Status: 200 (OK)
ErrorCode: 2012

Additional Information:
error-0: 2012: Managed Identity credential was rejected by the storage service.

Content:

Operation exceeded maximum processing time

  • API version: 2.1
  • Thrown when: in the LRO GET request when polling CreateCustomFormModelOperation.
  • Frequency: usually accompanies errors "Could not access Azure blob storage account" and "Managed Identity credential was rejected by the storage service" described above but only affects one or two tests.
  • Error message example:
Azure.RequestFailedException : Invalid model created with ID c97a11b9-b61f-4077-8d07-4cca37e4a254
Status: 200 (OK)
ErrorCode: 3013

Additional Information:
error-0: 3013: Operation exceeded maximum processing time.

Content:

Action items

In order to prevent errors InvalidRequest and 3014 from breaking the pipeline daily, we are suppressing them with the IgnoreServiceError attribute in our test project. The attribute is set on the class level (instead of single method) because it can happen on any test that builds a model, which includes most of our tests.

Once the service has fixed this issue on their side, we must remove those attributes from the following classes:

  • DocumentModelAdministrationClientLiveTests
  • DocumentAnalysisClientLiveTests
  • DocumentAnalysisSamples
  • FormRecognizerSamples
  • FormTrainingClientLiveTests
  • OperationsLiveTests
  • RecognizeCustomFormsLiveTests
@v-xuto
Copy link
Member

v-xuto commented Mar 8, 2023

@kinelski What is the current progress on this issue?

@joseharriaga
Copy link
Member

What is the likelihood that a test that encounters one of these issues would pass if retried?

I've been seeing flaky responses from the text analytics service too, and:

  • Just like you have here, most of them follow specific, known patterns.
  • The tests succeed on a retry virtually always.

Here's what I did:

  1. Reported them to the service team.
  2. Created the RetryOnErrorAttribute (based on some code that Jesse shared with me 😊). It's basically a duplicate of the RetryAttribute from NUnit, and the only differences are:
    2.1. Instead of retrying on failed asserts, it retries on an error (such as an exception) combined with a configurable condition.
    2.2. If a test continues to fail with the same pattern after a configurable number of tries, the test is marked as inconclusive.
  3. I put this attribute in the core test framework so other libraries can re-use it.
  4. Created the RetryOnInternalServerErrorAttribute for the specific use case of text analytics. Notice how I check for three different known patterns as part of the retry condition.

I wonder if something like this would help here?

@github-actions
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @ctstone @vkurpad.

Copy link

Hi @kinelski, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Oct 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Cognitive - Form Recognizer Service Attention Workflow: This issue is responsible by Azure service team. Service This issue points to a problem in the service.
Projects
None yet
Development

No branches or pull requests

4 participants