Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow/child_workflow retry #885

Merged
merged 2 commits into from
Jun 28, 2018
Merged

Conversation

yiminc-zz
Copy link

server side workflow/child_workflow retry

@yiminc-zz yiminc-zz force-pushed the wf_retry branch 2 times, most recently from 32bdcb1 to c786a31 Compare June 26, 2018 20:26
@@ -97,6 +97,7 @@ const (
TaskTypeWorkflowTimeout
TaskTypeDeleteHistoryEvent
TaskTypeRetryTimer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to TaskTypeActivityRetryTimer

@@ -349,6 +349,8 @@ struct ContinueAsNewWorkflowExecutionDecisionAttributes {
30: optional binary input
40: optional i32 executionStartToCloseTimeoutSeconds
50: optional i32 taskStartToCloseTimeoutSeconds
60: optional i32 backoffStartIntervalInSeconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want to also put failure reason and details on the ContinueAsNewEvent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task created to track this #902

@@ -417,6 +423,7 @@ struct WorkflowExecutionContinuedAsNewEventAttributes {
50: optional i32 executionStartToCloseTimeoutSeconds
60: optional i32 taskStartToCloseTimeoutSeconds
70: optional i64 (js.type = "Long") decisionTaskCompletedEventId
80: optional i32 backoffStartIntervalInSeconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have RetryPolicy on the decision but not persisting it to the event.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is persist in the WorkflowExecutionStartedEvent of next iteration.

@@ -417,6 +423,7 @@ struct WorkflowExecutionContinuedAsNewEventAttributes {
50: optional i32 executionStartToCloseTimeoutSeconds
60: optional i32 taskStartToCloseTimeoutSeconds
70: optional i64 (js.type = "Long") decisionTaskCompletedEventId
80: optional i32 backoffStartIntervalInSeconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it different from retryPolicy.initialIntervalSeconds?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the backoff interval to create the first schedule event for next iteration (which may not be the first time of retry). The initialInterval is the backoff interval for first retry.

@@ -477,6 +479,18 @@ func (b *historyBuilder) newWorkflowExecutionStartedEvent(
attributes.ChildPolicy = request.ChildPolicy
attributes.ContinuedExecutionRunId = previousRunID
attributes.Identity = common.StringPtr(common.StringDefault(request.Identity))
attributes.RetryPolicy = request.RetryPolicy
attributes.Attempt = common.Int32Ptr(startRequest.GetAttempt())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow to have Attempt and expiration being nil.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if RetryPolicy is nil, it will just set the nil to attributes. The GetAttempt() will return 0 for nil case.

if request.RetryPolicy != nil && request.RetryPolicy.GetExpirationIntervalInSeconds() > 0 {
expirationInSeconds := request.RetryPolicy.GetExpirationIntervalInSeconds()
deadline := time.Unix(0, historyEvent.GetTimestamp()).Add(time.Second * time.Duration(expirationInSeconds))
attributes.ExpirationTimestamp = common.Int64Ptr(deadline.Round(time.Millisecond).UnixNano())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel right to have this logic at the history builder layer. I think Expiration should either always be set by the client or history engine and passed in as the start request to history builder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move to upper layer.

attributes.ExpirationTimestamp = common.Int64Ptr(deadline.Round(time.Millisecond).UnixNano())
}
} else {
attributes.ExpirationTimestamp = common.Int64Ptr(startRequest.GetExpirationTimestamp())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just directly set attributes.ExpirationTimestamp = startRequest.ExpirationTimestamp

I think we need to support this field not being set at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

continueAsNew.DecisionStartToCloseTimeout = di.DecisionTimeout

if newStateBuilder.GetReplicationState() != nil {
newStateBuilder.UpdateReplicationStateLastEventID(sourceClusterName, startedEvent.GetVersion(), di.ScheduleID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update replication state in both cases. Regular case should use the di.ScheduleID and retry case should still set it to startedEventID.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


tBuilder := t.historyService.getTimerBuilder(&context.workflowExecution)
var transferTasks, timerTasks []persistence.Task
tranT, timerT, err := getDeleteWorkflowTasksFromShard(t.shard, domainID, tBuilder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test to make sure we are not creating 2 separate delete tasks on workflow timeout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflowTimeout does not go through this path.

Copy link
Contributor

@samarabbas samarabbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Seems like there are a few followup items we should file as xdc work for this feature.

@yiminc-zz yiminc-zz merged commit 1e0ec6c into cadence-workflow:master Jun 28, 2018
wxing1292 pushed a commit that referenced this pull request Jun 28, 2018
wxing1292 pushed a commit that referenced this pull request Jun 28, 2018
wxing1292 added a commit that referenced this pull request Jun 28, 2018
yiminc-zz added a commit to yiminc-zz/cadence that referenced this pull request Jun 29, 2018
yiminc-zz added a commit to yiminc-zz/cadence that referenced this pull request Aug 22, 2018
…cadence-workflow#910)"

This reverts commit 492faf0.

Fix continueAsNew replication issue

update doc for retry policy in idl

remove unnecessary nil check

handle WorkflowRetryTimerTask correctly on standby side
yiminc-zz added a commit to yiminc-zz/cadence that referenced this pull request Aug 22, 2018
…cadence-workflow#910)"

This reverts commit 492faf0.

Fix continueAsNew replication issue

update doc for retry policy in idl

remove unnecessary nil check

handle WorkflowRetryTimerTask correctly on standby side

update test for cassandra tools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants