Message loss due to race conditions with ContinueAsNew #67

cgillum · 2017-09-22T18:38:20Z

Summary

Message loss or instance termination may be observed if an orchestrator completes after calling ContinueAsNew and subsequently processes any other message before restarting.

Details

When the orchestrator completes execution after ContinueAsNew is called, the status is internally marked as "completed" and a new message is enqueued to restart it. During this window between completion and restart, it's possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It's also possible for the DTFx runtime to terminate the instance claiming possible state corruption.

Repro

Start with the counter sample
Create a new instance
Call RaiseEventAsync("operation", "incr") multiple times without waiting

Expected:
Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.

Workaround:
Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.

The text was updated successfully, but these errors were encountered:

ericleigh007 · 2017-11-01T12:05:04Z

I guess the workaround to make things a bit more reliable is to use a while or for in the orchestrator, thereby staving off the ContinueAsNew until N number of events.
Otherwise, Durable functions are really great stuff.
When do you think this extreme fragility with one of the guiding principles of DF's will be addressed? Do we need major function exec changes or changes within DF's, or something else..

Thanks again for the great work.

Hassmann · 2017-11-11T21:15:44Z

Yes, please, give us a solution. I am using a singleton and it is not reliable. I would have to run a web app service alone for a very small task, if this is not handled. Or give an indication that you do not plan to solve this in the near future, please.

cgillum · 2017-11-11T23:55:05Z

This one definitely needs to be solved, and it's high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

christiansparre · 2017-11-25T11:45:25Z

Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function :(

cgillum · 2017-12-12T06:30:31Z

I spent more time looking at this and experimenting with different fixes. Ultimately, I came to the conclusion that fixing this would be far more difficult than I originally anticipated. The short version is that the Durable Task Framework just wasn't designed for RPC/actor-style messaging workloads, and this issue is one example of why.

This is pretty painful (and a little embarrassing), but I think the actor-style pattern will need to be removed from the list of supported patterns. If it comes back, most likely it will need to come back in a different form and with a different backend implementation - i.e. something other than the Durable Task Framework in its current form. I've received the most critical feedback on this pattern from trusted sources (e.g. awkward programming model and confusing performance expectations), so I feel deprecating it is the right decision moving forward.

I want to be clear though. WaitForExternalEvent, ContinueAsNew, and "stateful singletons" will continue to be fully supported. This is really about clarifying which patterns can be implemented using these tools. For example, WaitForExternalEvent is still perfectly valid for the "human interaction" pattern. ContinueAsNew is also perfectly valid for implementing "eternal orchestrations". The pattern which is broken is the "actor" pattern where arbitrary clients send event messages at arbitrary times to a running instance.

I'll make a point of updating the samples and the documentation to reflect this change of direction. Let me know if there are any questions I can answer or clarifications I can provide.

christiansparre · 2017-12-12T07:13:29Z

I'm sorry to hear that @cgillum, this was really one of the missing pieces that would allow us to move a quite complex message processing application over to functions. We have a single part of the application where we need to manage concurrency pretty strictly, but the other 90% of the application is perfect for functions.

I really hope this comes back in one form or the other because it would be a pretty strong capability in the functions world and would simplify my life quite massively :)

Thank you for explaining your reasoning and keep up the good work!

SimonLuckenuik · 2017-12-12T09:47:04Z

Thanks for the update @cgillum, I hope the Actor concept will come back in one form or another, this was enabling very interesting and complex scenarios.

Hassmann · 2017-12-12T13:38:58Z

Thanks, Chris. It's very valuable to have clarity, no matter the news.

I could work around the issue so far and will follow this path then.
Performance and scalability are still a concern, though. So many seconds until a scheduled function is actually executed...
Could we get in contact to put my mind at ease with a few questions?

Andreas Hassmann.

hello@tezos.blue

jedjohan · 2017-12-22T06:57:01Z

Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I "spam" the "add product to cart"-button some really strange things happen. Hope you find a solution :)

cfe84 · 2018-03-05T04:28:23Z

Adding to the "that's really sad" pile, this would have been SUCH a differentiator against Lambdas.

cgillum · 2018-04-16T22:42:04Z

I've done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.

FYI @gled4er

gled4er · 2018-04-17T01:13:58Z

Hello @cgillum,

Thank you very much!

Until Azure/azure-functions-durable-extension#67 is fixed.

SimonLuckenuik · 2018-05-28T12:08:19Z

@cgillum , just thinking out loud, maybe adding some behavior to ContinueAsNew to fetch remaining events (based on previous internal instance ID), just after starting the new orchestrator, making sure no events are lost?

SimonLuckenuik · 2018-06-29T02:50:01Z

Or maybe adding a new API to fetch the "waiting but not processed events" so that we can fetch them and provide them to the input of ContinueAsNew ?

cgillum · 2019-03-16T16:44:14Z

Resolved in v1.8.0 release.

cfe84 · 2019-03-17T03:41:05Z

YOU, SIR, ARE THE BEST! \o/

cgillum added bug dtfx labels Nov 11, 2017

cgillum added the documentation label Dec 12, 2017

cgillum mentioned this issue Dec 29, 2017

Replace "Counter" sample with "Monitor" sample #118

Closed

5 tasks

cgillum added this to the General Availability (GA) milestone Mar 10, 2018

cgillum mentioned this issue Mar 25, 2018

Expected behavior when unexpected event is received? #29

Closed

andreujuanc mentioned this issue Apr 6, 2018

Are eternal orchestrators deprecated? + Bug + Advice #232

Closed

cgillum removed the documentation label Apr 16, 2018

cfe84 added a commit to cfe84/azure-docs that referenced this issue Apr 17, 2018

Adding some warning for eternal orchestration

6f24efa

Until Azure/azure-functions-durable-extension#67 is fixed.

cfe84 mentioned this issue Apr 17, 2018

Adding some warning for eternal orchestration MicrosoftDocs/azure-docs#7331

Closed

cgillum removed this from the General Availability (GA) milestone May 1, 2018

cgillum mentioned this issue Aug 6, 2018

[Enhancement] Enhance external event usecases to support classic scenarios #378

Closed

Neil-p-Hughes mentioned this issue Nov 13, 2018

External event message loss due to async activity in orchestration #515

Closed

cgillum added this to the v1.8 Release milestone Jan 19, 2019

cgillum mentioned this issue Jan 19, 2019

Optimize Continue As New And Solve Missed Raised Events Azure/durabletask#251

Merged

cgillum modified the milestones: v1.8 Release, v1.7.2 Release Jan 22, 2019

cgillum added the fix-ready Indicates that an issue has been fixed and will be available in the next release. label Feb 9, 2019

cgillum modified the milestones: v1.7.2 Release, v1.8 Release Mar 4, 2019

cgillum closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message loss due to race conditions with ContinueAsNew #67

Message loss due to race conditions with ContinueAsNew #67

cgillum commented Sep 22, 2017

ericleigh007 commented Nov 1, 2017

Hassmann commented Nov 11, 2017

cgillum commented Nov 11, 2017

christiansparre commented Nov 25, 2017

cgillum commented Dec 12, 2017

christiansparre commented Dec 12, 2017 •

edited

Loading

SimonLuckenuik commented Dec 12, 2017

Hassmann commented Dec 12, 2017

jedjohan commented Dec 22, 2017

cfe84 commented Mar 5, 2018

cgillum commented Apr 16, 2018

gled4er commented Apr 17, 2018

SimonLuckenuik commented May 28, 2018 •

edited

Loading

SimonLuckenuik commented Jun 29, 2018

cgillum commented Mar 16, 2019

cfe84 commented Mar 17, 2019

Message loss due to race conditions with ContinueAsNew #67

Message loss due to race conditions with ContinueAsNew #67

Comments

cgillum commented Sep 22, 2017

Summary

Details

Repro

ericleigh007 commented Nov 1, 2017

Hassmann commented Nov 11, 2017

cgillum commented Nov 11, 2017

christiansparre commented Nov 25, 2017

cgillum commented Dec 12, 2017

christiansparre commented Dec 12, 2017 • edited Loading

SimonLuckenuik commented Dec 12, 2017

Hassmann commented Dec 12, 2017

jedjohan commented Dec 22, 2017

cfe84 commented Mar 5, 2018

cgillum commented Apr 16, 2018

gled4er commented Apr 17, 2018

SimonLuckenuik commented May 28, 2018 • edited Loading

SimonLuckenuik commented Jun 29, 2018

cgillum commented Mar 16, 2019

cfe84 commented Mar 17, 2019

christiansparre commented Dec 12, 2017 •

edited

Loading

SimonLuckenuik commented May 28, 2018 •

edited

Loading