Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activity Function getting called again even after the Orchestrator has completed. #839

Closed
deadwards90 opened this issue Jul 8, 2019 · 12 comments
Assignees
Labels
Needs: Attention 👋 Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers.

Comments

@deadwards90
Copy link

deadwards90 commented Jul 8, 2019

Describe the bug
We have multiple examples of an orchestrator completing and then sometime afterwards (minutes usually but sometimes over an hour depending on load) the activity function runs again, despite the orchestrator being marked as completed.

I've only seen this happen with suborchestrators but those are the ones we use heavily so that might be why.

Investigative information

  • Durable Functions extension version: 1.8.1, 1.8.3
  • Function App version (1.0 or 2.0): 2.0
  • Programming language used: C#

If deployed to Azure

  • Timeframe issue observed: Multiple timeframes over different days, see below
  • Function App name: ukwest-bill-proc-0b0354e76932, ukwest-bill-proc-7e56185c9cd1
  • Function name(s): UpsertSalesOrders, UpsertFulfillment
  • Region: UK West
  • Orchestration instance ID(s): cb79184c01f44cbc98f3b38b65594d90:176, 05ad7499417e46d79de6845b00126833:5458

To Reproduce
Not entirely sure, but it seems to be roughly the following:

  1. Start orchestrator.
  2. Create a few thousand suborchestrators, with each orchestrator having an activity.
  3. WhenAll the suborchestrators.

Here's a gist that is a very simplified version of what we're doing.

https://gist.github.com/dantheman999301/4454cdcaff2fd87f67128cc7aa4c0370

Instead of the loop though, we pull a big list of tasks we need to do from an external source, change them a little bit and then feed them into the suborchestrators.

Expected behavior
Activities only run once

Actual behavior
Activities run multiple times, even after orchestration is complete.

Screenshots
image

In this example we can see that ~5 minutes after the orchestrator has marked itself as completed, the activity runs again.

image
In this example we see the same situation, only an hour later (this was under larger load)

I have more examples if necessary.

What is weirder is that if I look at these in the DurableFunctionsHubHistory, I can't actually see those additional functions being called although I'm not completely familiar with these so I may querying this wrong.

So for the first screenshot we have this:
image

And for the second
image

I had expected to see extra events after the OrchestratorCompleted event after it had run again.

Known workarounds
Make the activity idempotent 🤷‍♂

@ghost ghost added the Needs: Triage 🔍 label Jul 8, 2019
@deadwards90 deadwards90 changed the title Activity Function getting called multiple times even after the Orchestrator has completed. Activity Function getting called again even after the Orchestrator has completed. Jul 8, 2019
@deadwards90
Copy link
Author

deadwards90 commented Jul 9, 2019

Done some more investigation on this.

What appears to be happening is that the functions are finishing but they are not completing the messages in the durablefunctionshub-workitems.

image

This was taken from Storage Explorer after the orchestrator had completed. We then had 5 activities run after 5 minutes when the function I assumed timed out, even though it had already returned to the orchestrator. After this the queue was empty as you would expect.

I'm going to try updating the function extension and will feedback if it's still an issue with after this. Unfortunately I cannot make a smaller reproducible example happen outside of our actual project.

EDIT: Still happens on 1.8.3

@ConnorMcMahon ConnorMcMahon added Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. and removed Needs: Triage 🔍 labels Aug 14, 2019
@ConnorMcMahon ConnorMcMahon self-assigned this Aug 14, 2019
@deadwards90
Copy link
Author

deadwards90 commented Sep 4, 2019

@ConnorMcMahon is there any information I can give you to help this along?

We're seeing a lot of these happening now and have had to write guards into most activities to make sure they don't run twice, and to return the output of the previous run.

To test this we just did a run which would create quite a lot of suborchestrators (thousands, each with at least one activity) and can see 100+ instances of duplicate activities being called.

As a note to the above, we do see instances where activities run but don't return to the orchestrator, then run a second time which does return.

@ConnorMcMahon
Copy link
Contributor

@dantheman999301

Sorry for the delay on this.

In general, you need to make your activity functions idempotent, as we guarantee at-least-once, not exactly-once execution.

That being said, activities executing after orchestrations finish does seem a bit odd, and is something for us to investigate. Just be aware that even if we fix this behavior, we will not be able to guarantee that your activity functions will execute once and only once.

@deadwards90
Copy link
Author

deadwards90 commented Oct 9, 2019

Thanks for getting back to me @ConnorMcMahon

I can understand the odd one getting run another time, but we're seeing a lot of them. I can see since the start of the month 261 duplicate operation attempts.

The pattern we tend to see from the logs is that:

  1. Activity runs.
  2. Activity completes and returns
  3. Orchestrator never seems to get the message that the activity has finished.
  4. Activity runs again.
  5. Activity completes.
  6. Orchestrator does get this message.

I can understand a message getting delivered twice, but this behaviour does not seem to be that. Another hint at what's happening seems to be that the activity tends to run again after the time it would take for the function to time out.

@deadwards90
Copy link
Author

More information time, we've actually had another issue as it appears the workaround we did did not work correctly.

To give some context, what we're doing with durable functions is taking a file that has a number of entities that need to be uploaded to a finance system.

We have orchestrators that are very similar to the following

public class UpsertItemOrchestrator
    {
        [FunctionName(nameof(UpsertItemOrchestrator))]
        public static async Task<ActivityResponse<Item>> RunOrchestrator(
            [OrchestrationTrigger] DurableOrchestrationContextBase context,
            ILogger logger)
        {
            var (fileId, item) = context.GetInput<(Guid, Item)>();
            using (logger.BeginScope(new Dictionary<string, object>
            {
                ["FileId"] = fileId
            }))
            {
                var itemOperationId = context.NewGuid();

                var response = await context.CallActivityWithRetryAsync<ActivityResponse<Item>>(nameof(Activities.UpsertItem),
                    new RetryOptions(TimeSpan.FromSeconds(1), 3), (fileId, itemOperationId , item));

                if (!response.IsSuccess)
                {
                    return new ActivityResponse<Item>
                    {
                        Value = item,
                        Error = new ActivityResponseError
                        {
                            Code = response.Error.Code,
                            Message = response.Error.Message,
                            Type = response.Error.Type
                        },
                        IsSuccess = false,
                        RowNumber = response.RowNumber
                    };
                }
            }

            return new ActivityResponse<Item>
            {
                Value = item,
                IsSuccess = true,
                RowNumber = item.RowNumber
            };
        }
    }

The key part here is that to work around the fact we were seeing the Orchestrator just not picking up that an activity had completed was to add var itemOperationId = context.NewGuid(); and then pass itemOperationId into an Activity.

In the activity we then have a bit of code wrapped around each call which checks to see whether the operation has run before:

        private async Task<ActivityResponse<TResult>> RunOperation<TResult>(
            Guid fileId,
            Guid operationId,
            DurableActivityContextBase context,
            CloudTable cloudTable, 
            CloudBlobContainer blobContainer, 
            ILogger logger,
            Func<Task<ActivityResponse<TResult>>> operation)
        {
            var operationEntity = new OperationEntity { PartitionKey = fileId.ToString(), RowKey = operationId.ToString() };
            var (alreadyRun, retrievedEntity) = await HasOperationBeenRun(cloudTable,
                operationEntity);

            if (alreadyRun)
            {
                logger.LogWarning("Duplicate operation attempt {instanceId} {operationId}", context.InstanceId, operationId);
                return await retrievedEntity.GetDataAsObject<ActivityResponse<TResult>>(blobContainer);
            }

            var result = await operation();

            operationEntity.Data = JsonConvert.SerializeObject(result);
            await SaveOperation(cloudTable, blobContainer, operationEntity);

            return result;
        }

Fairly simple. Once we implemented this we can see in our logs this occurring frequently.

image

So in a file that contained what would essentially amount to 1800~ activities, we have 93 duplicates that we detected. So that's 5% duplicate operations, which seems high.

It is also worth noticing that these tend to come in blocks.

image

However, now we have started to see something even stranger, which is duplicates that have different operationIds. That would mean that the Orchestrator created a new Guid and passed that through.

Below is our table where we record the operations that are being attempted. PartitionKey is the fileId from the first snippet, whilst RowKey is the operationId. The data we use to return whatever the original function (which at this point would have already finished). I've highlighted two rows which you can see what actually ran twice but they have different operationIds. From my knowledge of Durable Functions, this should be impossible.

image

And the accompanying logs from Durable functions:

image

image

We saw 138 of these, giving us a grand total of 231 duplicate attempts (12% of total activities). That is very high. Most of these happened around the same times as the other duplicates.

For now I'm going to have to assign an ID for each operation much much earlier.

I really do not think this is to do with a message accidentally getting delivered more than once.

@ConnorMcMahon
Copy link
Contributor

@dantheman999301

To clarify, the flow of error you described in your second to last post is technically possible.

  1. Activity runs.
  2. Activity completes and returns
  3. Orchestrator never seems to get the message that the activity has finished.
  4. Activity runs again.
  5. Activity completes.
  6. Orchestrator does get this message.

This can happen if in step 2 the activity completes executing, but the instance of the functions host that executed that activity dies before it is able to persist it's results as a message in the orchestrators control queue. This race condition is definitely possible, but it should be very rare, and 5-12% of duplicate results is definitely abnormal.

Also, the experience where you are seeing different operation ids is definitely not something that you should be seeing.

One possibility is that you are starting multiple orchestrations that share the same orchestration instance id. This could cause both of the odd behaviors you are seeing. I would need to dive into the logs to see if this is happening.

Can you provide a more recent orchestration instance id for an orchestration instance that experienced this odd behavior?

@ConnorMcMahon ConnorMcMahon added the Needs: Author Feedback Waiting for the author of the issue to respond to a question label Oct 9, 2019
@deadwards90
Copy link
Author

deadwards90 commented Oct 10, 2019

This can happen if in step 2 the activity completes executing, but the instance of the functions host that executed that activity dies before it is able to persist it's results as a message in the orchestrators control queue. This race condition is definitely possible, but it should be very rare, and 5-12% of duplicate results is definitely abnormal.

That's good to know. Essentially the guard we've written in should protect against that.

Just to give you some more context to exactly what we're doing:

  1. We get a file which has a many lines of data that need to go into another system.
  2. We split these lines into sets of operations.
  3. We start an sub-orchestrator for each set of operations. These can involve one or more activities. At the start of the month these files tend to include thousands of operations, so we have thousands of sub-orchestrators. Due to technical limitations of the third party system we have limited the amount of concurrent sub-orchestrators and activities.
  4. Suborchestrator starts, generates an operationId for each activity and then runs the activity.
  5. Response is returned to sub-orchestrator.
  6. When all sub-orchestrators are done, main orchestrator either moves on to the next set of operations or finishes.

Essentially just a workflow with large fan-outs.

One possibility is that you are starting multiple orchestrations that share the same orchestration instance id. This could cause both of the odd behaviors you are seeing. I would need to dive into the logs to see if this is happening.

This is interesting. Doing further digging this morning on one of the above sub-orchestrators which demonstrated the issue does show that for some reason the orchestrator started twice, 6 minutes apart, scheduling the same activity.

image

Now compare that to the history of the same instance in DurableFunctionsHubHistory

image

InstanceId for this one is a9a3bbca11694bb28bc5cd89bde0208b:1269. As you can see the orchestrator that started at 12:36 just does not appear in history where as it has been logged into app insights.

Now the cloud role instance does change between those two orchestrators, I don't know if that is a hint to something being wrong but given the fact there is 6 minutes between them, it doesn't strike me as much of a race condition.

One possibility is that you are starting multiple orchestrations that share the same orchestration instance id. This could cause both of the odd behaviors you are seeing. I would need to dive into the logs to see if this is happening.

Whilst we do control the instance id of the top-level orchestrator, we don't set the instance ids of the sub-orchestrators.

However I'm happy to share any instance ids!

The examples I've been pasting you have been from:
Main orchestrator instance id - 48617f37-1603-428b-ae1a-1b758322d6a1
Sub orchestrator instance id - a9a3bbca11694bb28bc5cd89bde0208b:1269

EDIT: And another run we just did in our test environment to see if we can replicate this easily. We sort of can but it's not very consistent. Attached below are the instance ids of the sub-orchestrator along with how many times I can see that they started.

duplicate-multi-orchestrator.txt


It's worth noting as well that this particular problem where the orchestrator seems to be running twice minutes apart is different to the original issue I've raised with activities running twice, although they end up with the same result.

Another example where we have detected correctly that a duplicate was attempted (from the same run)

image

So this one of the original 93 we detected. In this one there is only one orchestrator which you can see completes fine. Then, for whatever reason, nearly 15 minutes after it has completed the activity randomly runs again.

Main orchestrator instance id - 48617f37-1603-428b-ae1a-1b758322d6a1
Sub orchestrator instance id - a9a3bbca11694bb28bc5cd89bde0208b:228

Just in case, here is a list of other sub-orchestrators where we have detected duplicates from 1st October.

duplicateInstanceIds-one-orchestrator.txt

@ghost ghost added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Oct 10, 2019
@ConnorMcMahon
Copy link
Contributor

ConnorMcMahon commented Oct 10, 2019

Thanks for the details @dantheman999301! It should be enough information to help us really debug.

I'm actually going to prioritize the duplicate orchestration calls, as that is a far trickier problem for you to be able to mitigate from your end. One thing you can attempt to mitigate from your end is to provide your own sub orchestration instance id. If there is a bug in our logic for creating unique sub orchestrator ids, then providing your own might fix the problem.

The lingering activity executing after the orchestration finishes, while it is annoying, should at least be mitigated by your idempotency check you have implemented. I will tackle that second.

@ConnorMcMahon
Copy link
Contributor

ConnorMcMahon commented Oct 15, 2019

@dantheman999301

We think we have found what happened (at least with orchestration "48617f37-1603-428b-ae1a-1b758322d6a1").

The problem(s):
Bear with me, as this is a deep dive into how the Azure Storage implementation of Durable Task works behind the scenes. If you just want a fix, you can skip below.

When your parent orchestration scheduled about ~1800 sub-orchestrations, what is going on behind the scenes is that we are first sending all of those start sub-orchestration messages to the control queue, and then updating the instance history for the parent orchestration so that future replays of the parent orchestration know that those messages have already been sent. These can't be done in the same transaction, as Azure Storage does not allow transactions across Queues and Tables. We send the queue messages first, and then update the history table to ensure we don't drop messages if a failure happens between the two operations.

If the race condition hits where the process fails between the queue messages being sent and us updating the instance history table, future replays of the orchestration will resend those messages. Normally, this is fine, because we handle duplicate events in most scenarios. However, for start orchestration messages, we unfortunately overwrite the existing history and start the new orchestration from scratch. We are looking to change this behavior in Durable Functions 2.0, but this is a breaking change. We hopefully will backport it to Durable Functions 1.8.4 as an optional configuration via host.json.

As far as why you are hitting this race condition, it looks like it has to do with this deadlock in the Storage SDK. I can't confirm this without a memory dump, which is hard to get for a consumption function app; however, it seems very likely based on the behavior I am seeing. I see the first 11 batches of the update of the history table (we have to do 1800 requests over 18+ batches, because we can only update 100 rows at a time), but after 11 the thread seems to get stuck. Then when other workers grab the partition and resume execution, they will resend the orchestration messages. We have a partial fix in the work for this, but it just reboots the worker when we get stuck. in this deadlock. Unfortunately, that means the history table will remain only partially updated, which means you will still be seeing duplicate sub orchestrations.

When duplicate sub-orchestration starts up, we unfortunately can get different values for context.NewGuid(). This is because it internally depends on context.CurrentUtcDateTime, which is different now that the history times are reset for the duplicate suborchestration. I wish we could change this behavior as well, but that would also be a breaking change.

TLDR: A series of design decisions/bugs in the various dependencies of Durable Functions all aligned to make context.NewGuid() non-reliable for your sub orchestrations. Some of these behaviors/bugs will be changed in Durable Functions 2.0 (which will GA in the next month), but because many of them are breaking changes, we may not be able to backport all of these fixes to Durable Functions 1.0.

Workarounds:
The crux of the issue is due to the above limitations, we can not reliably trust the call to context.NewGuid() in your sub-orchestrations. If you create the durable guid in your parent orchestration, and then pass that in as an input into your sub orchestration, you can always rely on it for your idempotency checks.

Hope that helps, and sorry for the info dump. Let me know if you need any clarifications.

@deadwards90
Copy link
Author

@ConnorMcMahon massive thanks for the thorough response, I can't imagine that was easy to track down! No need to apolgise for the info dump, that's all good information.

We actually thought about doing the context.NewGuid() in the parent orchestrator as a fix but wanted to wait to see what you found before continuing in case you found something that was an "easy" fix.

@sebastianburckhardt
Copy link
Collaborator

I think some of the changes we are planning (Azure/durabletask#324) that reverse the persistence order (save history first, then send messages) would also fix this problem.

For example, suborchestrations would be guaranteed to resume instead of starting over (because the algorithm detects duplicate ExecutionStartedEvent and does not replace the original suborchestration).

@ConnorMcMahon
Copy link
Contributor

This should now be at least partially resolved in Durable 2.x due to us no longer overriding running orchestration instances when calling StartNewAsync() with an existing orchestration instance id. That alone should be enough to stop this cascading sequence of events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs: Attention 👋 Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers.
Projects
None yet
Development

No branches or pull requests

3 participants