-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StateMachine rewrite #172
StateMachine rewrite #172
Conversation
@@ -49,7 +49,7 @@ | |||
next.newCall(method, callOptions)) { | |||
@Override | |||
public void sendMessage(ReqT message) { | |||
log.trace("Invoking " + method.getFullMethodName() + "with input: " + message); | |||
log.trace("Invoking \"" + method.getFullMethodName() + "\" with input: " + message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] dumb question (based on similar change above), when do we use single quote (') vs double quote (") for these types of messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guideline here. I would support coming up with one and making sure that all the log messages obey it.
@@ -42,13 +42,19 @@ | |||
TEMPORAL_METRICS_PREFIX + "workflow_task_schedule_to_start_latency"; | |||
public static final String WORKFLOW_TASK_EXECUTION_LATENCY = | |||
TEMPORAL_METRICS_PREFIX + "workflow_task_execution_latency"; | |||
// Total latency of a workflow task which can include multiple synchronous decision tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even with the comment, it is still not obvious to me what the difference is between execution latency and execution total latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] also styling question on when to use "//" comment vs "/** */" comment delimiters, as is done below on lines 48, 51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed comments.
To understand this comment you need to know about workflow task heartbeat feature. Basically a workflow task when reporting completion can set ForceCreateNewWorkflowTask property to true on RespondWorkflowTaskCompletedRequest. In this case, the new workflow task is created immediately and returned as a response to the RespondWorkflowTaskCompletedRequest call. Such call acts as a heartbeat that extends the duration of a workflow task. Thus a single total execution latency of a workflow task can be larger than latencies of multiple such synchronously created tasks.
return elapsedTime; | ||
@Override | ||
public String toString() { | ||
return "ExecuteLocalActivityParameters{" + "activityTask=" + activityTask + '}'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] no single or double quote delimiters around activityTask here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can remove the first plus sign and concatenate the first two strings (e.g. "ExecuteLocalActivityParameters{activityTask=" + activityTask + "}"
|
||
Consumer<Exception> signalWorkflowExecution( | ||
Functions.Proc1<Exception> signalExternalWorkflowExecution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why some of these methods have full comments (such as startChildWorkflow), but then some like this one (signalExternalWorkflowExecution) don't
// Only fail workflow task on the first attempt, subsequent failures of the same workflow task | ||
// should timeout. This is to avoid spin on the failed workflow task as the service doesn't | ||
// yet increase the retry interval. | ||
if (workflowTask.getAttempt() > 0) { | ||
if (workflowTask.getAttempt() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully we added a unit test to cover this as this should have been caught when I was syncing to Alex's latest proto changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We had a test that ensured that the worker doesn't spin on the task failure. But after I added an additional check that the task failure is reported exactly once I found that it wasn't doing so. Here is the new unit test and the fix. The unit test fails with attempts condition changed.
* Handles a single workflow task. | ||
* | ||
* @param workflowTask task to handle | ||
* @return true if new task should be created synchronously as local activities are still running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] what does it mean by return "true if new task should be created synchronously"? Does this mean not null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comment about force creating workflow tasks.
|
||
public WorkflowTaskResult( | ||
List<Command> commands, | ||
Map<String, WorkflowQueryResult> queryResults, | ||
boolean forceCreateNewWorkflowTask, | ||
boolean finalCommand) { | ||
boolean finalCommand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we swapped the order to two arguments with the same type...hopefully all the callers have been updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it is error-prone. Changed WorkflowTaskResult to use the Builder pattern.
import io.temporal.api.enums.v1.CommandType; | ||
import io.temporal.api.enums.v1.EventType; | ||
|
||
class ActionOrEventType<Action> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of class implies two possible configurations, but it looks like its ActionOrEventTypeOrCommandType (three)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored StateMachine nomenclature to match the standard one.
.add( | ||
State.SCHEDULE_COMMAND_CREATED, | ||
CommandType.COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK, | ||
State.SCHEDULE_COMMAND_CREATED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a typo / copy paste error (SCHEDULE_COMMAND_CREATED twice)? Maybe I'm interpreting this incorrectly? I assumed first line is current state, subsequent line is potential transition, second line is final state and last line is an additional callback to run. Or should we add another state between "CREATED" and "SCHEDULE_COMMAND_CREATED"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is practically saying that this event is not causing a state transition. We could introduce another event here, but it doesn't really bring any value as no action is executed.
State.STARTED_ACTIVITY_CANCEL_COMMAND_CREATED, | ||
ActivityStateMachine::createRequestCancelActivityTaskCommand) | ||
.add( | ||
State.STARTED_ACTIVITY_CANCEL_COMMAND_CREATED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this state and the cancel_command_Created state, we don't have transitions for a timed out event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added unit tests that demonstrate that these transitions are not possible. The basic idea is that the workflow code always executes in the context of an event loop. It causes activity timeouts to be delivered before the code that causes the timeout which generates the cancel command.
SCHEDULED_ACTIVITY_CANCEL_EVENT_RECORDED, | ||
STARTED_ACTIVITY_CANCEL_COMMAND_CREATED, | ||
STARTED_ACTIVITY_CANCEL_EVENT_RECORDED, | ||
COMPLETED_CANCEL_REQUESTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are these 3 *_CANCEL_REQUESTED states used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Removed.
Failure.newBuilder() | ||
.setActivityFailureInfo(failureInfo) | ||
.setCause(timedOut.getFailure()) | ||
.setMessage("Activity task timedOut") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have string constants file for these error messages so that its easy for a technical writer to review or so we can do translations (in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you file an issue for this? I think it is a great idea, but we want to do it holistically with the whole SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we will never do translations :-)
.setWorkflowId(requestCancelAttributes.getWorkflowId()) | ||
.setRunId(requestCancelAttributes.getRunId()) | ||
.build(); | ||
completionCallback.apply(null, new CancelExternalWorkflowException(execution, "", null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have any information as to why cancellation failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added message.
State.CANCELED, | ||
ChildWorkflowStateMachine::cancelStartChildCommand) | ||
.add( | ||
State.START_EVENT_RECORDED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't get a cancel event type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChildWorkflow is canceled using CancelExternalStateMachine.
src/main/java/io/temporal/internal/statemachines/ChildWorkflowStateMachine.java
Show resolved
Hide resolved
return state; | ||
} | ||
} | ||
throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we do this check before applying the callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After any such exception, the whole StateMachines object should be thrown away.
… state-machine-review
State.SCHEDULED_ACTIVITY_CANCEL_COMMAND_CREATED, | ||
ActivityStateMachine::createRequestCancelActivityTaskCommand) | ||
.add( | ||
State.SCHEDULED_ACTIVITY_CANCEL_COMMAND_CREATED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] swap the order of these two state transitions just for consistency (have the self-transition appear first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -24,7 +24,7 @@ | |||
|
|||
/** | |||
* Defines behaviour of the parent workflow when {@link CancellationScope} that wraps child workflow | |||
* execution request is cancelled. The result of the cancellation independently of the type is a | |||
* execution request is canceled. The result of the cancellation independently of the type is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same mess on GoSDK and server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in charge to approve this but it is impressive. One day we should clean up Go SDK also.
ExplicitEvent.SCHEDULE, | ||
State.REQUEST_PREPARED, | ||
LocalActivityStateMachine::sendRequest) | ||
.add(State.REQUEST_PREPARED, ExplicitEvent.GET_REQUEST, State.REQUEST_SENT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better name we can come up with than "GET_REQUEST?" maybe "CREATE_REQUEST" or CREATEORGET_REQUEST or something like that? And how do I interpret what REQUEST_SENT means? To me that sounds like something was sent to the server, but this is a local activity, so not sure if there is a more appropriate name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to MARK_AS_SENT
.add(State.REQUEST_PREPARED, ExplicitEvent.GET_REQUEST, State.REQUEST_SENT) | ||
.add( | ||
State.REQUEST_SENT, | ||
ExplicitEvent.NON_REPLAY_WORKFLOW_TASK_STARTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this basically means local activity execution started? or that we've started executing the workflow? is there a scenario where this same event gets invoked twice? can we come up with a more friendly name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that WORKFLOW_TASK_STARTED event for a non replay task has started. Not sure what would be the better name.
State.REQUEST_SENT) | ||
.add( | ||
State.REQUEST_SENT, | ||
ExplicitEvent.HANDLE_RESPONSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and is this for explicitly reporting the local activity result? if so, can we have better naming for explicit event here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to HANDLE_RESULT
State.MARKER_COMMAND_RECORDED, | ||
LocalActivityStateMachine::notifyResultFromEvent) | ||
.add( | ||
// This is to cover the following edge case: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had a hard time following this. when calling getExecutionState wouldn't it still be as if the "first time" it ran? OR does the force workflowtask is started create a persistent checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forced workflow task is a workflow task that is recorded in the history the same way as a normal workflow task.
@@ -143,8 +144,8 @@ public Builder setMaxConcurrentLocalActivityExecutionSize( | |||
* every 10 seconds. This can be used to protect down stream services from flooding. The zero | |||
* value of this uses the default value. Default is unlimited. | |||
*/ | |||
public Builder setTaskQueueActivitiesPerSecond(double taskQueueActivitiesPerSecond) { | |||
this.taskQueueActivitiesPerSecond = taskQueueActivitiesPerSecond; | |||
public Builder setMaxTaskQueueActivitiesPerSecond(double maxTaskQueueActivitiesPerSecond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have negative check here? Or do we treat that as 0? Interesting that we have different behavior than above (although I guess w/different defaults that makes sense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above option is used only locally. So validation should be done by the SDK. The maxTaskQueueActivitiesPerSecond is an option sent to the service. We made a design decision to not validate on the client any server-side options.
ChildWorkflowStateMachine::createStartChildCommand) | ||
.add( | ||
State.START_COMMAND_CREATED, | ||
CommandType.COMMAND_TYPE_START_CHILD_WORKFLOW_EXECUTION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the side effects of not defining the no-op transitions that are command based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only defined transitions are allowed. Workflow task fails if command or event is applied to the state that doesn't have a transition defined for it.
*/ | ||
@Override | ||
public void handleCommand(CommandType commandType) { | ||
stateMachine.handleCommand(commandType, (Data) this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are events registered/validated, but commands are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is pretty evolved. The short answer is that this validation is required to support mutable side effect and versions which can have events missing for the given state machine. So if event doesn't match then it still can be valid as it just doesn't apply to this specific state machine. The same doesn't apply to commands as they always have to match the state machine that created them.
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
State Machine Rewrite #172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Complete rewrite of the workflow task handling state machine. It was caused by the need to fix immediate cancellation issues that weren't possible to solve using previous design.