Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lease stealing can result in duplicate function execution #382

Closed
cgillum opened this issue Jul 7, 2018 · 0 comments · Fixed by Azure/durabletask#360
Closed

Lease stealing can result in duplicate function execution #382

cgillum opened this issue Jul 7, 2018 · 0 comments · Fixed by Azure/durabletask#360

Comments

@cgillum
Copy link
Member

cgillum commented Jul 7, 2018

Related to #344, if a lease is stolen from a worker instance, then any buffered messages associated with that leased partition need to be abandoned. Otherwise there is a greater risk of split-brain and possibly duplicate function execution.

Consider the following hypothetical scenario:

  1. Worker 1 holds lease A, B, C, and D
  2. Orchestration on partition D schedules 1,000 concurrent activity functions
  3. Worker 1 starts fetching response messages for orchestration
  4. Worker 2 is added and steals leases C & D and starts fetching the remaining response messages for orchestration
  5. Workers 1 & 2 process response messages at the same time. Some of them may result in subsequent activity function execution.
  6. Worker 2 completes a batch of messages for the orchestration first
  7. Worker 1 completes another batch of messages for the orchestration second and schedules subsequent actions (if any), but then gets flagged for split-brain when it tries to commit to the orchestration history. At this point, the side-effects have already been scheduled and can't be undone.. The "TaskCompleted" messages which triggered this episode are then abandoned.
  8. Worker 2 picks up the abandoned "TaskCompleted" messages, processes them, and the subsequent activity functions are executed a second time.

One way to avoid this is to without introducing pessimistic locking is to check the locally owned partitions before the commit step and proactively abandoning the current work item. This is not a 100% guarantee though since there is still a small window of time where a partition could be stolen.

@cgillum cgillum changed the title Lease stealing can result in long delays for orchestrator messages Lease stealing can result in duplicate function execution Jul 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant