Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle more edge case when apply events #836

Merged
merged 5 commits into from
Jun 12, 2018
Merged

Handle more edge case when apply events #836

merged 5 commits into from
Jun 12, 2018

Conversation

wxing1292
Copy link
Contributor

@wxing1292 wxing1292 commented Jun 11, 2018

  • UT for history replicator
  • Remove deadcode
  • Bugfix start & signal workflow

solve #675, #814, #791

@wxing1292 wxing1292 force-pushed the hReplictor branch 2 times, most recently from e236d8f to e19bc84 Compare June 11, 2018 21:42
@@ -37,6 +38,9 @@ import (
"github.com/uber/cadence/common/persistence"
)

var errRetry = &shared.RetryTaskError{Message: "History replicator retry error"}
var errDLQ = errors.New("History replicator DQL error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Fix "DQL" on error message.

@@ -37,6 +38,9 @@ import (
"github.com/uber/cadence/common/persistence"
)

var errRetry = &shared.RetryTaskError{Message: "History replicator retry error"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define all errors within the same var section. There is already one error for RetryTaskError.

@@ -77,32 +81,39 @@ func newHistoryReplicator(shard ShardContext, historyEngine *historyEngineImpl,
}

func (r *historyReplicator) ApplyEvents(request *h.ReplicateEventsRequest) (retError error) {
defer func() {
if _, ok := retError.(*shared.EntityNotExistsError); ok {
retError = errRetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch all error could make debugging little harder. I think we should also include the actual message here. Or atleast log a warning before returning the catch all error.


// GetWorkflowExecution failed with some transient error. Return err so we can retry the task later
if _, ok := err.(*shared.EntityNotExistsError); !ok {
} else if _, ok := err.(*shared.EntityNotExistsError); !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why converting to else-if?

if _, ok := err.(*shared.EntityNotExistsError); ok {
return ErrRetryEntityNotExists
if _, ok := err.(*shared.EntityNotExistsError); !ok {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrRetryEntityNotExists error may not be needed if you remove this.

msBuilder.executionInfo.LastFirstEventID = firstEvent.GetEventId()
msBuilder.executionInfo.NextEventID = lastEvent.GetEventId() + 1
incomingVersion := firstEvent.GetVersion()
replicationState := &persistence.ReplicationState{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should get rid of this and just call:
msBuilder.updateReplicationStateLastEventID

executionInfo.LastFirstEventID = firstEvent.GetEventId()
executionInfo.NextEventID = lastEvent.GetEventId() + 1
incomingVersion := firstEvent.GetVersion()
msBuilder.UpdateReplicationStateLastEventID("", incomingVersion, lastEvent.GetEventId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the sourceCluster from replication task instead of empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

_, err = createWorkflow(isBrandNew, errExist.RunID)
}

requestID := uuid.New()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all this logic into ApplyStartEvent? We can simplify this logic to deal with only other events and have an assertion that it should not be called for WorkflowExecutionStartedEvent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i think we should keep this logic like this.

all other functions, other than ApplyReplicationTask should prepare the mutable state.
ApplyReplicationTask should apply events to mutable state. and accordingly, create workflow or update workflow.
this can be useful if say, we want to apply events from a remote cluster to recover from a disaster, or simple reply a workflow using existing events for debugging reason


// we can also use the start version
if currentLastWriteVersion > incomingVersion {
logger.Infof("Dropping replication task. Current RunID: %v, Current LastWriteVersion: %v, Incoming Version: %v.",
Copy link
Contributor

@samarabbas samarabbas Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment the 'StaleReplicationEventsCounter' counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added all 3 types:
HistoryConflictsCounter
StaleReplicationEventsCounter
BufferedReplicationTaskCounter

if rState.LastWriteVersion > incomingVersion {
// Replication state is already on a higher version, we can drop this event
// TODO: We need to replay external events like signal to the new version
logger.Warnf("Dropping stale replication task. CurrentV: %v, LastWriteV: %v, LastWriteEvent: %v, IncomingV: %v.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment the 'StaleReplicationEventsCounter' counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

logger.Warnf("Dropping stale replication task. CurrentV: %v, LastWriteV: %v, LastWriteEvent: %v, IncomingV: %v.",
rState.CurrentVersion, rState.LastWriteVersion, rState.LastWriteEventID, incomingVersion)
return nil, nil
} else if rState.LastWriteVersion < incomingVersion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change 'else if' to 'if'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine

logger.Errorf("No ReplicationInfo Found For Previous Active Cluster. Previous Active Cluster: %v, Request Source Cluster: %v, Request ReplicationInfo: %v.",
previousActiveCluster, request.GetSourceCluster(), request.ReplicationInfo)
// TODO: Handle missing replication information, #840
return nil, ErrMissingReplicationInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include the existing comment for this part of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy pasted the existing one: "// Returning BadRequestError to force the message to land into DLQ"
is this the comment you are talking about?

r.metricsClient.IncCounter(metrics.ReplicateHistoryEventsScope, metrics.BufferedReplicationTaskCounter)
// Detect conflict
if ri.GetLastEventId() < rState.LastWriteEventID {
logger.Infof("Conflict detected. State: {CurrentV: %v, LastWriteV: %v, LastWriteEvent: %v}, ReplicationInfo: {PrevActiveCluster: %v, V: %v, LastEventID: %v}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment 'HistoryConflictsCounter' counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

msBuilder.GetNextEventID(), firstEvent.GetEventId())
r.metricsClient.IncCounter(metrics.ReplicateHistoryEventsScope, metrics.BufferedReplicationTaskCounter)
// Detect conflict
if ri.GetLastEventId() < rState.LastWriteEventID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you changed '!=' to '<'. Whats the reasoning? Can we capture that as a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
} else if ri.GetLastEventId() > rState.LastWriteEventID {
// TODO handle this case
logger.Errorf("Conflict detected. State: {CurrentV: %v, LastWriteV: %v, LastWriteEvent: %v}, ReplicationInfo: {PrevActiveCluster: %v, V: %v, LastEventID: %v}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I get this. Can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say if we have a bug, this case can happen, we should return some unique error and do logging

logger.Errorf("Conflict detected. State: {CurrentV: %v, LastWriteV: %v, LastWriteEvent: %v}, ReplicationInfo: {PrevActiveCluster: %v, V: %v, LastEventID: %v}",
rState.CurrentVersion, rState.LastWriteVersion, rState.LastWriteEventID,
previousActiveCluster, ri.GetVersion(), ri.GetLastEventId())
return nil, ErrMissingReplicationInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a different error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
// ErrCorruptedReplicationInfo is returned when replication task has corrupted replication information from source cluster
ErrCorruptedReplicationInfo = &shared.BadRequestError{Message: "replication task is has corrupted cluster replication info"}

firstEventID := request.GetFirstEventId()
if firstEventID < msBuilder.GetNextEventID() {
// duplicate replication task
replicationState := msBuilder.GetReplicationState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment 'StaleReplicationEventsCounter' metric counter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all done

logger.Debugf("Dropping replication task. State: {NextEvent: %v, Version: %v, LastWriteV: %v, LastWriteEvent: %v}",
msBuilder.GetNextEventID(), replicationState.CurrentVersion, replicationState.LastWriteVersion, replicationState.LastWriteEventID)
return nil
} else if firstEventID > msBuilder.GetNextEventID() {
Copy link
Contributor

@samarabbas samarabbas Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 'else if' to 'if'

logger.Debugf("Buffer out of order replication task. NextEvent: %v, FirstEvent: %v",
msBuilder.GetNextEventID(), firstEventID)

err = msBuilder.BufferReplicationTask(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment 'BufferedReplicationTaskCounter' metric counter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@samarabbas samarabbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Can you address my minor comments? It is good to land once you address those.

@wxing1292 wxing1292 merged commit 583fe1c into master Jun 12, 2018
@wxing1292 wxing1292 deleted the hReplictor branch June 12, 2018 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants