You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #344, if a lease is stolen from a worker instance, then any buffered messages associated with that leased partition need to be abandoned. Otherwise there is a greater risk of split-brain and possibly duplicate function execution.
Consider the following hypothetical scenario:
Worker 1 holds lease A, B, C, and D
Orchestration on partition D schedules 1,000 concurrent activity functions
Worker 1 starts fetching response messages for orchestration
Worker 2 is added and steals leases C & D and starts fetching the remaining response messages for orchestration
Workers 1 & 2 process response messages at the same time. Some of them may result in subsequent activity function execution.
Worker 2 completes a batch of messages for the orchestration first
Worker 1 completes another batch of messages for the orchestration second and schedules subsequent actions (if any), but then gets flagged for split-brain when it tries to commit to the orchestration history. At this point, the side-effects have already been scheduled and can't be undone.. The "TaskCompleted" messages which triggered this episode are then abandoned.
Worker 2 picks up the abandoned "TaskCompleted" messages, processes them, and the subsequent activity functions are executed a second time.
One way to avoid this is to without introducing pessimistic locking is to check the locally owned partitions before the commit step and proactively abandoning the current work item. This is not a 100% guarantee though since there is still a small window of time where a partition could be stolen.
The text was updated successfully, but these errors were encountered:
cgillum
changed the title
Lease stealing can result in long delays for orchestrator messages
Lease stealing can result in duplicate function execution
Jul 7, 2018
Related to #344, if a lease is stolen from a worker instance, then any buffered messages associated with that leased partition need to be abandoned. Otherwise there is a greater risk of split-brain and possibly duplicate function execution.
Consider the following hypothetical scenario:
One way to avoid this is to without introducing pessimistic locking is to check the locally owned partitions before the commit step and proactively abandoning the current work item. This is not a 100% guarantee though since there is still a small window of time where a partition could be stolen.
The text was updated successfully, but these errors were encountered: