Skip to content

Activity results lost under high load #46

@davkrall-futureordering

Description

Background
I tested my Azure Durable Function App with a dedicated Durable Task Scheduler instance as backend, and found unexpected errors related to activity results being lost during a load test, see the description and attached reproduction project below. Under normal load the application worked as expected.

Description

  1. A given activity writes an entity to a Table storage account, a success message is logged after this.
  2. The following error is logged (IDs redacted):
TaskActivityDispatcher-ID: Unhandled exception with work item WORK_ITEM_ID': Grpc.Core.RpcException: Status(StatusCode="NotFound", Detail="Work item 'WORK_ITEM_ID' not found")
   at Microsoft.DurableTask.AzureManagedBackend.AzureManagedOrchestrationService.<>c__DisplayClass63_0.<<CompleteTaskActivityWorkItemAsync>b__0>d.MoveNext() in /_/src/SDK/Microsoft.DurableTask.AzureManagedBackend/AzureManagedOrchestrationService.cs:line 1006
--- End of stack trace from previous location ---
   at Microsoft.DurableTask.AzureManagedBackend.AzureManagedOrchestrationService.ExecuteWithRetryAsync(Func`1 action, String operationName, Object request, Func`2 summarizeGrpcRequestFunction) in /_/src/SDK/Microsoft.DurableTask.AzureManagedBackend/AzureManagedOrchestrationService.cs:line 1546
   at Microsoft.DurableTask.AzureManagedBackend.AzureManagedOrchestrationService.CompleteTaskActivityWorkItemAsync(TaskActivityWorkItem workItem, TaskMessage responseMessage) in /_/src/SDK/Microsoft.DurableTask.AzureManagedBackend/AzureManagedOrchestrationService.cs:line 1005
   at DurableTask.Core.TaskActivityDispatcher.OnProcessWorkItemAsync(TaskActivityWorkItem workItem) in /_/src/DurableTask.Core/TaskActivityDispatcher.cs:line 279
   at DurableTask.Core.TaskActivityDispatcher.OnProcessWorkItemAsync(TaskActivityWorkItem workItem) in /_/src/DurableTask.Core/TaskActivityDispatcher.cs:line 302
   at DurableTask.Core.WorkItemDispatcher`1.ProcessWorkItemAsync(WorkItemDispatcherContext context, Object workItemObj) in /_/src/DurableTask.Core/WorkItemDispatcher.cs:line 373

Backing off for 1 seconds until 5 successful operations
  1. The following warning is logged (IDs redacted):
Abandoning activity work item for [ACTIVITY_NAME#4] of orchestration 'ORCHESTRATION_ID' with completion token COMPLETION_TOKEN.
  1. The orchestration will retry, and call the same activity.
  2. The activity will fail to write to the Table storage, since the entity already exists (status code 409, error code EntityAlreadyExists)

Context
App Service Plan: P1v3, 3 instances
Function app: Linux, .NET 8 isolated
AzureFunctionsJobHost__extensions__durableTask__maxConcurrentActivityFunctions: 30
AzureFunctionsJobHost__extensions__durableTask__maxConcurrentOrchestratorFunctions: 20
AzureFunctionsJobHost__extensions__durableTask__storageProvider__partitionCount: 4 (note: we don't know if this applies to DTS)
DTS: Dedacted SKU, 1 capacity unit

As mentioned, this behaviour did not occur under normal load, and we have never observed this issue using the Azure Storage backend. See attached a minimal project using which I could reproduce the behaviour with conflicts during the storage account write retries, although I have not yet managed to get the same logs. It's also worth mentioning that I could only reproduce the behaviour when adding durable entities to this project.

TroubleshootingDts.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions