Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StateMachine rewrite #172

Merged
merged 31 commits into from
Aug 14, 2020
Merged

Conversation

mfateev
Copy link
Member

@mfateev mfateev commented Aug 7, 2020

Complete rewrite of the workflow task handling state machine. It was caused by the need to fix immediate cancellation issues that weren't possible to solve using previous design.

@@ -49,7 +49,7 @@
next.newCall(method, callOptions)) {
@Override
public void sendMessage(ReqT message) {
log.trace("Invoking " + method.getFullMethodName() + "with input: " + message);
log.trace("Invoking \"" + method.getFullMethodName() + "\" with input: " + message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] dumb question (based on similar change above), when do we use single quote (') vs double quote (") for these types of messages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guideline here. I would support coming up with one and making sure that all the log messages obey it.

@@ -42,13 +42,19 @@
TEMPORAL_METRICS_PREFIX + "workflow_task_schedule_to_start_latency";
public static final String WORKFLOW_TASK_EXECUTION_LATENCY =
TEMPORAL_METRICS_PREFIX + "workflow_task_execution_latency";
// Total latency of a workflow task which can include multiple synchronous decision tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even with the comment, it is still not obvious to me what the difference is between execution latency and execution total latency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] also styling question on when to use "//" comment vs "/** */" comment delimiters, as is done below on lines 48, 51

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed comments.

To understand this comment you need to know about workflow task heartbeat feature. Basically a workflow task when reporting completion can set ForceCreateNewWorkflowTask property to true on RespondWorkflowTaskCompletedRequest. In this case, the new workflow task is created immediately and returned as a response to the RespondWorkflowTaskCompletedRequest call. Such call acts as a heartbeat that extends the duration of a workflow task. Thus a single total execution latency of a workflow task can be larger than latencies of multiple such synchronously created tasks.

return elapsedTime;
@Override
public String toString() {
return "ExecuteLocalActivityParameters{" + "activityTask=" + activityTask + '}';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] no single or double quote delimiters around activityTask here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you can remove the first plus sign and concatenate the first two strings (e.g. "ExecuteLocalActivityParameters{activityTask=" + activityTask + "}"


Consumer<Exception> signalWorkflowExecution(
Functions.Proc1<Exception> signalExternalWorkflowExecution(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why some of these methods have full comments (such as startChildWorkflow), but then some like this one (signalExternalWorkflowExecution) don't

// Only fail workflow task on the first attempt, subsequent failures of the same workflow task
// should timeout. This is to avoid spin on the failed workflow task as the service doesn't
// yet increase the retry interval.
if (workflowTask.getAttempt() > 0) {
if (workflowTask.getAttempt() > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully we added a unit test to cover this as this should have been caught when I was syncing to Alex's latest proto changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We had a test that ensured that the worker doesn't spin on the task failure. But after I added an additional check that the task failure is reported exactly once I found that it wasn't doing so. Here is the new unit test and the fix. The unit test fails with attempts condition changed.

* Handles a single workflow task.
*
* @param workflowTask task to handle
* @return true if new task should be created synchronously as local activities are still running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] what does it mean by return "true if new task should be created synchronously"? Does this mean not null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comment about force creating workflow tasks.


public WorkflowTaskResult(
List<Command> commands,
Map<String, WorkflowQueryResult> queryResults,
boolean forceCreateNewWorkflowTask,
boolean finalCommand) {
boolean finalCommand,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we swapped the order to two arguments with the same type...hopefully all the callers have been updated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it is error-prone. Changed WorkflowTaskResult to use the Builder pattern.

import io.temporal.api.enums.v1.CommandType;
import io.temporal.api.enums.v1.EventType;

class ActionOrEventType<Action> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of class implies two possible configurations, but it looks like its ActionOrEventTypeOrCommandType (three)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored StateMachine nomenclature to match the standard one.

.add(
State.SCHEDULE_COMMAND_CREATED,
CommandType.COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK,
State.SCHEDULE_COMMAND_CREATED)
Copy link
Member

@mastermanu mastermanu Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo / copy paste error (SCHEDULE_COMMAND_CREATED twice)? Maybe I'm interpreting this incorrectly? I assumed first line is current state, subsequent line is potential transition, second line is final state and last line is an additional callback to run. Or should we add another state between "CREATED" and "SCHEDULE_COMMAND_CREATED"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is practically saying that this event is not causing a state transition. We could introduce another event here, but it doesn't really bring any value as no action is executed.

State.STARTED_ACTIVITY_CANCEL_COMMAND_CREATED,
ActivityStateMachine::createRequestCancelActivityTaskCommand)
.add(
State.STARTED_ACTIVITY_CANCEL_COMMAND_CREATED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this state and the cancel_command_Created state, we don't have transitions for a timed out event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added unit tests that demonstrate that these transitions are not possible. The basic idea is that the workflow code always executes in the context of an event loop. It causes activity timeouts to be delivered before the code that causes the timeout which generates the cancel command.

SCHEDULED_ACTIVITY_CANCEL_EVENT_RECORDED,
STARTED_ACTIVITY_CANCEL_COMMAND_CREATED,
STARTED_ACTIVITY_CANCEL_EVENT_RECORDED,
COMPLETED_CANCEL_REQUESTED,
Copy link
Member

@mastermanu mastermanu Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these 3 *_CANCEL_REQUESTED states used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed.

Failure.newBuilder()
.setActivityFailureInfo(failureInfo)
.setCause(timedOut.getFailure())
.setMessage("Activity task timedOut")
Copy link
Member

@mastermanu mastermanu Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have string constants file for these error messages so that its easy for a technical writer to review or so we can do translations (in the future)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you file an issue for this? I think it is a great idea, but we want to do it holistically with the whole SDK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we will never do translations :-)

.setWorkflowId(requestCancelAttributes.getWorkflowId())
.setRunId(requestCancelAttributes.getRunId())
.build();
completionCallback.apply(null, new CancelExternalWorkflowException(execution, "", null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have any information as to why cancellation failed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added message.

State.CANCELED,
ChildWorkflowStateMachine::cancelStartChildCommand)
.add(
State.START_EVENT_RECORDED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't get a cancel event type here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChildWorkflow is canceled using CancelExternalStateMachine.

return state;
}
}
throw new IllegalStateException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we do this check before applying the callback?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After any such exception, the whole StateMachines object should be thrown away.

State.SCHEDULED_ACTIVITY_CANCEL_COMMAND_CREATED,
ActivityStateMachine::createRequestCancelActivityTaskCommand)
.add(
State.SCHEDULED_ACTIVITY_CANCEL_COMMAND_CREATED,
Copy link
Member

@mastermanu mastermanu Aug 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] swap the order of these two state transitions just for consistency (have the self-transition appear first)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -24,7 +24,7 @@

/**
* Defines behaviour of the parent workflow when {@link CancellationScope} that wraps child workflow
* execution request is cancelled. The result of the cancellation independently of the type is a
* execution request is canceled. The result of the cancellation independently of the type is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same mess on GoSDK and server side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@alexshtin alexshtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in charge to approve this but it is impressive. One day we should clean up Go SDK also.

ExplicitEvent.SCHEDULE,
State.REQUEST_PREPARED,
LocalActivityStateMachine::sendRequest)
.add(State.REQUEST_PREPARED, ExplicitEvent.GET_REQUEST, State.REQUEST_SENT)
Copy link
Member

@mastermanu mastermanu Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better name we can come up with than "GET_REQUEST?" maybe "CREATE_REQUEST" or CREATEORGET_REQUEST or something like that? And how do I interpret what REQUEST_SENT means? To me that sounds like something was sent to the server, but this is a local activity, so not sure if there is a more appropriate name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to MARK_AS_SENT

.add(State.REQUEST_PREPARED, ExplicitEvent.GET_REQUEST, State.REQUEST_SENT)
.add(
State.REQUEST_SENT,
ExplicitEvent.NON_REPLAY_WORKFLOW_TASK_STARTED,
Copy link
Member

@mastermanu mastermanu Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this basically means local activity execution started? or that we've started executing the workflow? is there a scenario where this same event gets invoked twice? can we come up with a more friendly name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that WORKFLOW_TASK_STARTED event for a non replay task has started. Not sure what would be the better name.

State.REQUEST_SENT)
.add(
State.REQUEST_SENT,
ExplicitEvent.HANDLE_RESPONSE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is this for explicitly reporting the local activity result? if so, can we have better naming for explicit event here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to HANDLE_RESULT

State.MARKER_COMMAND_RECORDED,
LocalActivityStateMachine::notifyResultFromEvent)
.add(
// This is to cover the following edge case:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a hard time following this. when calling getExecutionState wouldn't it still be as if the "first time" it ran? OR does the force workflowtask is started create a persistent checkpoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forced workflow task is a workflow task that is recorded in the history the same way as a normal workflow task.

@@ -143,8 +144,8 @@ public Builder setMaxConcurrentLocalActivityExecutionSize(
* every 10 seconds. This can be used to protect down stream services from flooding. The zero
* value of this uses the default value. Default is unlimited.
*/
public Builder setTaskQueueActivitiesPerSecond(double taskQueueActivitiesPerSecond) {
this.taskQueueActivitiesPerSecond = taskQueueActivitiesPerSecond;
public Builder setMaxTaskQueueActivitiesPerSecond(double maxTaskQueueActivitiesPerSecond) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have negative check here? Or do we treat that as 0? Interesting that we have different behavior than above (although I guess w/different defaults that makes sense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above option is used only locally. So validation should be done by the SDK. The maxTaskQueueActivitiesPerSecond is an option sent to the service. We made a design decision to not validate on the client any server-side options.

ChildWorkflowStateMachine::createStartChildCommand)
.add(
State.START_COMMAND_CREATED,
CommandType.COMMAND_TYPE_START_CHILD_WORKFLOW_EXECUTION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the side effects of not defining the no-op transitions that are command based?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only defined transitions are allowed. Workflow task fails if command or event is applied to the state that doesn't have a transition defined for it.

*/
@Override
public void handleCommand(CommandType commandType) {
stateMachine.handleCommand(commandType, (Data) this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are events registered/validated, but commands are not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is pretty evolved. The short answer is that this validation is required to support mutable side effect and versions which can have events missing for the given state machine. So if event doesn't match then it still can be valid as it just doesn't apply to this specific state machine. The same doesn't apply to commands as they always have to match the state machine that created them.

@mfateev mfateev merged commit d357d79 into temporalio:master Aug 14, 2020
@mfateev mfateev deleted the state-machine-review branch August 14, 2020 17:28
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this pull request Jul 7, 2022
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this pull request Jul 7, 2022
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this pull request Jul 7, 2022
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Spikhalskiy added a commit to Spikhalskiy/java-sdk that referenced this pull request Jul 7, 2022
State Machine Rewrite temporalio#172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Spikhalskiy added a commit that referenced this pull request Jul 7, 2022
State Machine Rewrite #172 left a couple of state machine code in the class that is not responsible for handling state transitions anymore and this code is not invoked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants