-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activity Function getting called again even after the Orchestrator has completed. #839
Comments
Done some more investigation on this. What appears to be happening is that the functions are finishing but they are not completing the messages in the This was taken from Storage Explorer after the orchestrator had completed. We then had 5 activities run after 5 minutes when the function I assumed timed out, even though it had already returned to the orchestrator. After this the queue was empty as you would expect. I'm going to try updating the function extension and will feedback if it's still an issue with after this. Unfortunately I cannot make a smaller reproducible example happen outside of our actual project. EDIT: Still happens on 1.8.3 |
@ConnorMcMahon is there any information I can give you to help this along? We're seeing a lot of these happening now and have had to write guards into most activities to make sure they don't run twice, and to return the output of the previous run. To test this we just did a run which would create quite a lot of suborchestrators (thousands, each with at least one activity) and can see 100+ instances of duplicate activities being called. As a note to the above, we do see instances where activities run but don't return to the orchestrator, then run a second time which does return. |
@dantheman999301 Sorry for the delay on this. In general, you need to make your activity functions idempotent, as we guarantee at-least-once, not exactly-once execution. That being said, activities executing after orchestrations finish does seem a bit odd, and is something for us to investigate. Just be aware that even if we fix this behavior, we will not be able to guarantee that your activity functions will execute once and only once. |
Thanks for getting back to me @ConnorMcMahon I can understand the odd one getting run another time, but we're seeing a lot of them. I can see since the start of the month 261 duplicate operation attempts. The pattern we tend to see from the logs is that:
I can understand a message getting delivered twice, but this behaviour does not seem to be that. Another hint at what's happening seems to be that the activity tends to run again after the time it would take for the function to time out. |
More information time, we've actually had another issue as it appears the workaround we did did not work correctly. To give some context, what we're doing with durable functions is taking a file that has a number of entities that need to be uploaded to a finance system. We have orchestrators that are very similar to the following public class UpsertItemOrchestrator
{
[FunctionName(nameof(UpsertItemOrchestrator))]
public static async Task<ActivityResponse<Item>> RunOrchestrator(
[OrchestrationTrigger] DurableOrchestrationContextBase context,
ILogger logger)
{
var (fileId, item) = context.GetInput<(Guid, Item)>();
using (logger.BeginScope(new Dictionary<string, object>
{
["FileId"] = fileId
}))
{
var itemOperationId = context.NewGuid();
var response = await context.CallActivityWithRetryAsync<ActivityResponse<Item>>(nameof(Activities.UpsertItem),
new RetryOptions(TimeSpan.FromSeconds(1), 3), (fileId, itemOperationId , item));
if (!response.IsSuccess)
{
return new ActivityResponse<Item>
{
Value = item,
Error = new ActivityResponseError
{
Code = response.Error.Code,
Message = response.Error.Message,
Type = response.Error.Type
},
IsSuccess = false,
RowNumber = response.RowNumber
};
}
}
return new ActivityResponse<Item>
{
Value = item,
IsSuccess = true,
RowNumber = item.RowNumber
};
}
} The key part here is that to work around the fact we were seeing the Orchestrator just not picking up that an activity had completed was to add In the activity we then have a bit of code wrapped around each call which checks to see whether the operation has run before: private async Task<ActivityResponse<TResult>> RunOperation<TResult>(
Guid fileId,
Guid operationId,
DurableActivityContextBase context,
CloudTable cloudTable,
CloudBlobContainer blobContainer,
ILogger logger,
Func<Task<ActivityResponse<TResult>>> operation)
{
var operationEntity = new OperationEntity { PartitionKey = fileId.ToString(), RowKey = operationId.ToString() };
var (alreadyRun, retrievedEntity) = await HasOperationBeenRun(cloudTable,
operationEntity);
if (alreadyRun)
{
logger.LogWarning("Duplicate operation attempt {instanceId} {operationId}", context.InstanceId, operationId);
return await retrievedEntity.GetDataAsObject<ActivityResponse<TResult>>(blobContainer);
}
var result = await operation();
operationEntity.Data = JsonConvert.SerializeObject(result);
await SaveOperation(cloudTable, blobContainer, operationEntity);
return result;
} Fairly simple. Once we implemented this we can see in our logs this occurring frequently. So in a file that contained what would essentially amount to 1800~ activities, we have 93 duplicates that we detected. So that's 5% duplicate operations, which seems high. It is also worth noticing that these tend to come in blocks. However, now we have started to see something even stranger, which is duplicates that have different Below is our table where we record the operations that are being attempted. And the accompanying logs from Durable functions: We saw 138 of these, giving us a grand total of 231 duplicate attempts (12% of total activities). That is very high. Most of these happened around the same times as the other duplicates. For now I'm going to have to assign an ID for each operation much much earlier. I really do not think this is to do with a message accidentally getting delivered more than once. |
@dantheman999301 To clarify, the flow of error you described in your second to last post is technically possible.
This can happen if in step 2 the activity completes executing, but the instance of the functions host that executed that activity dies before it is able to persist it's results as a message in the orchestrators control queue. This race condition is definitely possible, but it should be very rare, and 5-12% of duplicate results is definitely abnormal. Also, the experience where you are seeing different operation ids is definitely not something that you should be seeing. One possibility is that you are starting multiple orchestrations that share the same orchestration instance id. This could cause both of the odd behaviors you are seeing. I would need to dive into the logs to see if this is happening. Can you provide a more recent orchestration instance id for an orchestration instance that experienced this odd behavior? |
That's good to know. Essentially the guard we've written in should protect against that. Just to give you some more context to exactly what we're doing:
Essentially just a workflow with large fan-outs.
This is interesting. Doing further digging this morning on one of the above sub-orchestrators which demonstrated the issue does show that for some reason the orchestrator started twice, 6 minutes apart, scheduling the same activity. Now compare that to the history of the same instance in
Now the cloud role instance does change between those two orchestrators, I don't know if that is a hint to something being wrong but given the fact there is 6 minutes between them, it doesn't strike me as much of a race condition.
Whilst we do control the instance id of the top-level orchestrator, we don't set the instance ids of the sub-orchestrators. However I'm happy to share any instance ids! The examples I've been pasting you have been from: EDIT: And another run we just did in our test environment to see if we can replicate this easily. We sort of can but it's not very consistent. Attached below are the instance ids of the sub-orchestrator along with how many times I can see that they started. duplicate-multi-orchestrator.txt It's worth noting as well that this particular problem where the orchestrator seems to be running twice minutes apart is different to the original issue I've raised with activities running twice, although they end up with the same result. Another example where we have detected correctly that a duplicate was attempted (from the same run) So this one of the original 93 we detected. In this one there is only one orchestrator which you can see completes fine. Then, for whatever reason, nearly 15 minutes after it has completed the activity randomly runs again. Main orchestrator instance id - Just in case, here is a list of other sub-orchestrators where we have detected duplicates from 1st October. |
Thanks for the details @dantheman999301! It should be enough information to help us really debug. I'm actually going to prioritize the duplicate orchestration calls, as that is a far trickier problem for you to be able to mitigate from your end. One thing you can attempt to mitigate from your end is to provide your own sub orchestration instance id. If there is a bug in our logic for creating unique sub orchestrator ids, then providing your own might fix the problem. The lingering activity executing after the orchestration finishes, while it is annoying, should at least be mitigated by your idempotency check you have implemented. I will tackle that second. |
@dantheman999301 We think we have found what happened (at least with orchestration "48617f37-1603-428b-ae1a-1b758322d6a1"). The problem(s): When your parent orchestration scheduled about ~1800 sub-orchestrations, what is going on behind the scenes is that we are first sending all of those start sub-orchestration messages to the control queue, and then updating the instance history for the parent orchestration so that future replays of the parent orchestration know that those messages have already been sent. These can't be done in the same transaction, as Azure Storage does not allow transactions across Queues and Tables. We send the queue messages first, and then update the history table to ensure we don't drop messages if a failure happens between the two operations. If the race condition hits where the process fails between the queue messages being sent and us updating the instance history table, future replays of the orchestration will resend those messages. Normally, this is fine, because we handle duplicate events in most scenarios. However, for start orchestration messages, we unfortunately overwrite the existing history and start the new orchestration from scratch. We are looking to change this behavior in Durable Functions 2.0, but this is a breaking change. We hopefully will backport it to Durable Functions 1.8.4 as an optional configuration via host.json. As far as why you are hitting this race condition, it looks like it has to do with this deadlock in the Storage SDK. I can't confirm this without a memory dump, which is hard to get for a consumption function app; however, it seems very likely based on the behavior I am seeing. I see the first 11 batches of the update of the history table (we have to do 1800 requests over 18+ batches, because we can only update 100 rows at a time), but after 11 the thread seems to get stuck. Then when other workers grab the partition and resume execution, they will resend the orchestration messages. We have a partial fix in the work for this, but it just reboots the worker when we get stuck. in this deadlock. Unfortunately, that means the history table will remain only partially updated, which means you will still be seeing duplicate sub orchestrations. When duplicate sub-orchestration starts up, we unfortunately can get different values for TLDR: A series of design decisions/bugs in the various dependencies of Durable Functions all aligned to make Workarounds: Hope that helps, and sorry for the info dump. Let me know if you need any clarifications. |
@ConnorMcMahon massive thanks for the thorough response, I can't imagine that was easy to track down! No need to apolgise for the info dump, that's all good information. We actually thought about doing the |
I think some of the changes we are planning (Azure/durabletask#324) that reverse the persistence order (save history first, then send messages) would also fix this problem. For example, suborchestrations would be guaranteed to resume instead of starting over (because the algorithm detects duplicate ExecutionStartedEvent and does not replace the original suborchestration). |
This should now be at least partially resolved in Durable 2.x due to us no longer overriding running orchestration instances when calling |
Describe the bug
We have multiple examples of an orchestrator completing and then sometime afterwards (minutes usually but sometimes over an hour depending on load) the activity function runs again, despite the orchestrator being marked as completed.
I've only seen this happen with suborchestrators but those are the ones we use heavily so that might be why.
Investigative information
If deployed to Azure
To Reproduce
Not entirely sure, but it seems to be roughly the following:
WhenAll
the suborchestrators.Here's a gist that is a very simplified version of what we're doing.
https://gist.github.com/dantheman999301/4454cdcaff2fd87f67128cc7aa4c0370
Instead of the loop though, we pull a big list of tasks we need to do from an external source, change them a little bit and then feed them into the suborchestrators.
Expected behavior
Activities only run once
Actual behavior
Activities run multiple times, even after orchestration is complete.
Screenshots
In this example we can see that ~5 minutes after the orchestrator has marked itself as completed, the activity runs again.
In this example we see the same situation, only an hour later (this was under larger load)
I have more examples if necessary.
What is weirder is that if I look at these in the DurableFunctionsHubHistory, I can't actually see those additional functions being called although I'm not completely familiar with these so I may querying this wrong.
So for the first screenshot we have this:
And for the second
I had expected to see extra events after the
OrchestratorCompleted
event after it had run again.Known workarounds
Make the activity idempotent 🤷♂
The text was updated successfully, but these errors were encountered: