Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message loss due to race conditions with ContinueAsNew #67

Closed
cgillum opened this issue Sep 22, 2017 · 16 comments
Closed

Message loss due to race conditions with ContinueAsNew #67

cgillum opened this issue Sep 22, 2017 · 16 comments
Labels
bug dtfx fix-ready Indicates that an issue has been fixed and will be available in the next release.
Milestone

Comments

@cgillum
Copy link
Member

cgillum commented Sep 22, 2017

Summary

Message loss or instance termination may be observed if an orchestrator completes after calling ContinueAsNew and subsequently processes any other message before restarting.

Details

When the orchestrator completes execution after ContinueAsNew is called, the status is internally marked as "completed" and a new message is enqueued to restart it. During this window between completion and restart, it's possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It's also possible for the DTFx runtime to terminate the instance claiming possible state corruption.

Repro

  1. Start with the counter sample
  2. Create a new instance
  3. Call RaiseEventAsync("operation", "incr") multiple times without waiting

Expected:
Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.

Workaround:
Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.

@ericleigh007
Copy link

I guess the workaround to make things a bit more reliable is to use a while or for in the orchestrator, thereby staving off the ContinueAsNew until N number of events.
Otherwise, Durable functions are really great stuff.
When do you think this extreme fragility with one of the guiding principles of DF's will be addressed? Do we need major function exec changes or changes within DF's, or something else..

Thanks again for the great work.

@Hassmann
Copy link

Yes, please, give us a solution. I am using a singleton and it is not reliable. I would have to run a web app service alone for a very small task, if this is not handled. Or give an indication that you do not plan to solve this in the near future, please.

@cgillum
Copy link
Member Author

cgillum commented Nov 11, 2017

This one definitely needs to be solved, and it's high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

@christiansparre
Copy link

Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function :(

@cgillum
Copy link
Member Author

cgillum commented Dec 12, 2017

I spent more time looking at this and experimenting with different fixes. Ultimately, I came to the conclusion that fixing this would be far more difficult than I originally anticipated. The short version is that the Durable Task Framework just wasn't designed for RPC/actor-style messaging workloads, and this issue is one example of why.

This is pretty painful (and a little embarrassing), but I think the actor-style pattern will need to be removed from the list of supported patterns. If it comes back, most likely it will need to come back in a different form and with a different backend implementation - i.e. something other than the Durable Task Framework in its current form. I've received the most critical feedback on this pattern from trusted sources (e.g. awkward programming model and confusing performance expectations), so I feel deprecating it is the right decision moving forward.

I want to be clear though. WaitForExternalEvent, ContinueAsNew, and "stateful singletons" will continue to be fully supported. This is really about clarifying which patterns can be implemented using these tools. For example, WaitForExternalEvent is still perfectly valid for the "human interaction" pattern. ContinueAsNew is also perfectly valid for implementing "eternal orchestrations". The pattern which is broken is the "actor" pattern where arbitrary clients send event messages at arbitrary times to a running instance.

I'll make a point of updating the samples and the documentation to reflect this change of direction. Let me know if there are any questions I can answer or clarifications I can provide.

@christiansparre
Copy link

christiansparre commented Dec 12, 2017

I'm sorry to hear that @cgillum, this was really one of the missing pieces that would allow us to move a quite complex message processing application over to functions. We have a single part of the application where we need to manage concurrency pretty strictly, but the other 90% of the application is perfect for functions.

I really hope this comes back in one form or the other because it would be a pretty strong capability in the functions world and would simplify my life quite massively :)

Thank you for explaining your reasoning and keep up the good work!

@SimonLuckenuik
Copy link

Thanks for the update @cgillum, I hope the Actor concept will come back in one form or another, this was enabling very interesting and complex scenarios.

@Hassmann
Copy link

Thanks, Chris. It's very valuable to have clarity, no matter the news.

I could work around the issue so far and will follow this path then.
Performance and scalability are still a concern, though. So many seconds until a scheduled function is actually executed...
Could we get in contact to put my mind at ease with a few questions?

Andreas Hassmann.

hello@tezos.blue

@jedjohan
Copy link

Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I "spam" the "add product to cart"-button some really strange things happen. Hope you find a solution :)

@cfe84
Copy link

cfe84 commented Mar 5, 2018

Adding to the "that's really sad" pile, this would have been SUCH a differentiator against Lambdas.

@cgillum
Copy link
Member Author

cgillum commented Apr 16, 2018

I've done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.

FYI @gled4er

@gled4er
Copy link
Collaborator

gled4er commented Apr 17, 2018

Hello @cgillum,

Thank you very much!

@SimonLuckenuik
Copy link

SimonLuckenuik commented May 28, 2018

@cgillum , just thinking out loud, maybe adding some behavior to ContinueAsNew to fetch remaining events (based on previous internal instance ID), just after starting the new orchestrator, making sure no events are lost?

@SimonLuckenuik
Copy link

Or maybe adding a new API to fetch the "waiting but not processed events" so that we can fetch them and provide them to the input of ContinueAsNew ?

@cgillum cgillum modified the milestones: v1.8 Release, v1.7.2 Release Jan 22, 2019
@cgillum cgillum added the fix-ready Indicates that an issue has been fixed and will be available in the next release. label Feb 9, 2019
@cgillum cgillum modified the milestones: v1.7.2 Release, v1.8 Release Mar 4, 2019
@cgillum
Copy link
Member Author

cgillum commented Mar 16, 2019

Resolved in v1.8.0 release.

@cgillum cgillum closed this as completed Mar 16, 2019
@cfe84
Copy link

cfe84 commented Mar 17, 2019

YOU, SIR, ARE THE BEST! \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dtfx fix-ready Indicates that an issue has been fixed and will be available in the next release.
Projects
None yet
Development

No branches or pull requests

8 participants