Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Eager Workflow Task Dispatch on SDKs #242

Open
5 of 6 tasks
Spikhalskiy opened this issue Mar 7, 2023 · 0 comments
Open
5 of 6 tasks

[Feature Request] Eager Workflow Task Dispatch on SDKs #242

Spikhalskiy opened this issue Mar 7, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Mar 7, 2023

Objective

To reduce the latency between WorkflowClient#start call and the dispatching of the workflow task to the worker, a new Server feature that performs an "Eager Workflow Task Dispatch" on a start call was implemented.
See relevant Server PRs for details:
temporalio/temporal#3835
temporalio/temporal#3928

TL;DR
A WorkflowClient that is aware of an existing local workflow worker can request an eager Workflow Task on the start call and get the first Workflow Task of the Workflow Execution back immediately in the Start call response. This allows Server to

  • skip matching for this workflow task
  • use one transaction instead of two for persisting both the WorkflowExecutionStarted+WorkflowTaskScheduled events and WorkflowTaskStarted event.

Start Call

sequenceDiagram
    actor C as UserCode
    participant Stub as WorkflowStub
    participant WC as WorkfkflowClient
    participant WF as WorkerFactory 
    participant W as Worker
    participant S as Server

    C->>WF: getWorkflowClient
    C->>+WC: start(eagerDispatch=true[default])
    WC->>WF: getWorker(taskQueue)
    alt there is a worker for the task queue
        WC->>+W: reserveWorkflowExecutor
        W-->>-WC: wftDispatchHandle [optional]

        opt if reserved a worker slot
            rect rgba(191, 223, 255, 0.4)  
                WC->>+S: start(eagerDispatch=true)
                S-->>-WC: eager WFT [optional]
                opt Server gave us back an eager task
                    WC->>W: wftDispatchHandle.dispatch(wft)
                  
                end
            end    

            WC->>WC: wftDispatchHandle.release()
            Note right of WC: .release() should be a no-op if .dispatch(wft) was called
            
        end
    else no worker or no free worker slot
        rect rgba(250, 50, 50, 0.3)
            WC->>S: start(eager=false)
        end   
    end
    WC-->>-C: WorkflowStub(execution)
Loading

WorkflowClient awareness of WorkerFactory

  1. WorkerFactory should register itself on the WorkflowClient the last thing during start and deregister itself as the first step during a shutdown.
  2. WorkflowClient implementation should maintain a set of registered WorkflowFactory.
  3. When routing an eager task WorkflowClient MUST try WorkflowFactory from the registered ones in random order.
    1. It’s not enough to just select one random WorkflowFactory, as its corresponding worker can be paused or get into a shutdown state right after selection. If this happens, Worker#reserveWorkflowExecution will return a null or another token meaning unsuccessful reservation and the next random WorkflowFactory SHOULD be used in an attempt to get the reservation.
    2. In normal use cases, WorkflowClient should have one WorkflowFactory at a time. So the performance of this random selection is not that important. Legitimate situations when WorkflowClient has several registered WorkerFactory:
      1. User “restarts” a WorkerFactory. With different sets of workers or different options. Users would typically start a new WorkerFactory first and shut down an old one after that. This will lead to a period of time with two active Worker fleets.
      2. Users use several WorkerFactory as means of “horizontal” scaling. It doesn’t make much sense to do. Instead, users should be scaling workers up. But it’s totally allowed by the current API at the moment. And it’s not an illegitimate approach. We don’t want all eager tasks from a WorkflowClient to be dispatched to only one WorkerFactory out of all assigned to the WorkflowClient.

Eager Dispatch Flow implementation notes

  1. WorkflowOptions gets a disableEagerExecution flag that is “false” by default.
  2. Eager Dispatch is “best effort”. If it doesn’t go through, an old-style dispatch is used. `WorkflowOptions#disableEagerExecution==false``is not a guarantee for an eager dispatch.
    1. The same instance of WorkflowClient that was used to create WorkerFactory SHOULD be used for WorkflowClient#start call to be performed with a local eager dispatch. This WorkflowClient MAY also be obtained from WorkflowFactory

    2. A corresponded worker MAY decline eager dispatch for any reason by

      1. Not providing a WFTDispatchHandle on Worker#reserveWorkflowExecutor call. WorkflowClient shouldn’t request eager dispatch from the Server then.
      2. Responding with false to WFTDispatchHandle#dispatch call on an already obtained reservation.

      It SHOULD decline if

      1. there are no free workflow task executor slots
      2. it is in any other state that is not ACTIVE (Considering NOT_STARTED, ACTIVE, PAUSED, SHUTDOWN, TERMINATED)
      3. it is unhealthy (definition may vary by SDK)

Local Workflow Completion

The original proposal included the concept of Local Workflow Completion. The idea was that a worker after receiving an eagerly dispatched workflow task may return a workflow completion promise that will be completed once the workflow is completed locally.

Later it was taken out because it was understood that this concept will provide little to no benefits. Reasoning: to provide local completion consistently with persisted history, the local completion should be filled by the worker only after receiving a response from the Server on the workflow task completion request. There is currently no reason to assume that this response will be received much earlier than a completion of a long poll opened by the Workflow Client for a workflow completion event. Both should typically be arriving on a single server → client trip.

SDKs supports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant