-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message loss due to race conditions with ContinueAsNew #67
Comments
I guess the workaround to make things a bit more reliable is to use a while or for in the orchestrator, thereby staving off the ContinueAsNew until N number of events. Thanks again for the great work. |
Yes, please, give us a solution. I am using a singleton and it is not reliable. I would have to run a web app service alone for a very small task, if this is not handled. Or give an indication that you do not plan to solve this in the near future, please. |
This one definitely needs to be solved, and it's high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs |
Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function :( |
I spent more time looking at this and experimenting with different fixes. Ultimately, I came to the conclusion that fixing this would be far more difficult than I originally anticipated. The short version is that the Durable Task Framework just wasn't designed for RPC/actor-style messaging workloads, and this issue is one example of why. This is pretty painful (and a little embarrassing), but I think the actor-style pattern will need to be removed from the list of supported patterns. If it comes back, most likely it will need to come back in a different form and with a different backend implementation - i.e. something other than the Durable Task Framework in its current form. I've received the most critical feedback on this pattern from trusted sources (e.g. awkward programming model and confusing performance expectations), so I feel deprecating it is the right decision moving forward. I want to be clear though. I'll make a point of updating the samples and the documentation to reflect this change of direction. Let me know if there are any questions I can answer or clarifications I can provide. |
I'm sorry to hear that @cgillum, this was really one of the missing pieces that would allow us to move a quite complex message processing application over to functions. We have a single part of the application where we need to manage concurrency pretty strictly, but the other 90% of the application is perfect for functions. I really hope this comes back in one form or the other because it would be a pretty strong capability in the functions world and would simplify my life quite massively :) Thank you for explaining your reasoning and keep up the good work! |
Thanks for the update @cgillum, I hope the Actor concept will come back in one form or another, this was enabling very interesting and complex scenarios. |
Thanks, Chris. It's very valuable to have clarity, no matter the news. I could work around the issue so far and will follow this path then. Andreas Hassmann. |
Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I "spam" the "add product to cart"-button some really strange things happen. Hope you find a solution :) |
Adding to the "that's really sad" pile, this would have been SUCH a differentiator against Lambdas. |
I've done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue. FYI @gled4er |
Hello @cgillum, Thank you very much! |
@cgillum , just thinking out loud, maybe adding some behavior to ContinueAsNew to fetch remaining events (based on previous internal instance ID), just after starting the new orchestrator, making sure no events are lost? |
Or maybe adding a new API to fetch the "waiting but not processed events" so that we can fetch them and provide them to the input of ContinueAsNew ? |
Resolved in v1.8.0 release. |
YOU, SIR, ARE THE BEST! \o/ |
Summary
Message loss or instance termination may be observed if an orchestrator completes after calling
ContinueAsNew
and subsequently processes any other message before restarting.Details
When the orchestrator completes execution after
ContinueAsNew
is called, the status is internally marked as "completed" and a new message is enqueued to restart it. During this window between completion and restart, it's possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It's also possible for the DTFx runtime to terminate the instance claiming possible state corruption.Repro
Expected:
Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.
Workaround:
Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.
The text was updated successfully, but these errors were encountered: