implemented functionality in timer ack manager & timer processor base to allow failover #648

wxing1292 · 2018-04-02T05:41:04Z

Partially solve #565

coveralls · 2018-04-02T06:00:49Z

Coverage increased (+0.5%) to 65.315% when pulling a3143d1 on timer-Q-failover into f57cd0d on master.

samarabbas · 2018-04-02T17:04:54Z

service/history/historyEngineInterfaces.go

+		SetCurrentTime(clusterName string, currentTime time.Time)
+	}
+
+	timerProcessor interface {


Naming is quite confusing between timerQueueProcessor and timerProcessor.

any suggestion?

samarabbas · 2018-04-02T17:13:18Z

service/history/shardContext.go

+			if currentTime, ok := shardInfo.ClusterTimerAckLevel[clusterName]; ok {
+				standbyClusterCurrentTime[clusterName] = currentTime
+			} else {
+				standbyClusterCurrentTime[clusterName] = shardInfo.TimerAckLevel


Why are we using the current cluster ack level to initialize the standby cluster?

there will be case that when a new cluster is added, there is no existing ack level for it, so we have to use the current cluster's ack level

samarabbas · 2018-04-02T17:30:27Z

service/history/timerGate.go

+	}
+
+	// LocalTimerGate interface
+	LocalTimerGate interface {


Any reason Close is not needed by remote timer gate? Both local and remote timers should have the same life-cycle.

local is implemented using a go timer, which can be closed.

remote is purely driven by events, and there is no background coroutine involved.

samarabbas · 2018-04-02T17:33:56Z

service/history/timerGate.go

 	close(timerGate.closeChan)
 }
+
+// NewRemoteTimerGate create a new timer gate instance
+func NewRemoteTimerGate() RemoteTimerGate {


Lot of implementation between Local and Remote looks the same. Can we refactor it in a way to reuse code?

I'm OK with copy/pasting code if this is temporary. But make sure create an issue to revisit this later.

this is actually not a copy paste, although in general they looks the same, however they are backed by different thing, one with go's timer, one with comparison of new time.

samarabbas

Overall the changes looks good. One thing which looks strange is you have too much logic/if-else due to failover support.
I don't have a better recommendation at this point but I would like you to think more if we can figure out a cleaner API contract rather than so many if-else statements.

samarabbas · 2018-04-04T19:03:13Z

service/history/timerQueueStandbyProcessor.go

-	})
-	timerQueueAckMgr := newTimerQueueAckMgr(shard, historyService.metricsClient, executionManager, clusterName, log)
+func newTimerQueueStandbyProcessor(shard ShardContext, historyService *historyEngineImpl, clusterName string, logger bark.Logger) *timerQueueStandbyProcessorImpl {
+	timeNow := func() time.Time {


We need to have capability to trail current time standby processor by some configured amount like 5 minutes.

i know, right now, there is no time delay in both timer processor and transfer processor, will be added (change is fairly small) when we are actually doing failover on both timer / transfer

samarabbas · 2018-04-04T19:21:59Z

service/history/timerQueueAckMgr.go

+		now = t.shard.GetCurrentTime(t.clusterName)
+	} else {
+		// if ack manager is a failover manager, we need to use the current local time
+		now = t.shard.GetCurrentTime(t.shard.GetService().GetClusterMetadata().GetCurrentClusterName())


clusterName is stored as part of timerQueueAckMgr struct. Can we use that here?

not really, when doing failover, timer ack manager has
cluster : standby cluster name
isFailover: true
and ack manager need to use the the above 2 to "update the shard timer ack level for failover process"

and here, when doing failover, we need to get the current active cluster's time to determine whether to process a timer

samarabbas · 2018-04-04T20:35:57Z

service/history/timerQueueProcessor.go

@@ -93,6 +104,13 @@ func (t *timerQueueProcessorImpl) SetCurrentTime(clusterName string, currentTime
 	standbyTimerProcessor.setCurrentTime(currentTime)
 }

+func (t *timerQueueProcessorImpl) FailoverDomain(domainID string, standbyClusterName string) {
+	// we should consider make the failover idempotent
+	failoverTimerProcessor := newTimerQueueFailoverProcessor(t.shard, t.historyService, domainID, standbyClusterName, t.logger)


how is the life-cycle of this cursor managed? Someone needs to call stop on it? or it calls stop by itself?

timer ack manager will have a finished chan which used by timer active processor, when chan fired, active processor will stop it self

samarabbas · 2018-04-04T20:37:08Z

service/history/timerQueueProcessorBase.go

@@ -224,7 +223,7 @@ func (t *timerQueueProcessorBase) internalProcessor() error {

 continueProcessor:
 	for {
-		now := t.shard.GetCurrentTime(t.clusterName)
+		now := t.now()


i basically change the "t.shard.GetCurrentTime(t.clusterName)" to a closure

…ive timer processor

… processor will try to process a standby task indefinitely, until task is finished.

samarabbas · 2018-04-16T16:07:51Z

common/persistence/cassandraPersistence.go

@@ -2037,9 +2037,6 @@ func (d *cassandraPersistence) GetTimerIndexTasks(request *GetTimerIndexTasksReq

 		response.Timers = append(response.Timers, t)
 	}
-	nextPageToken := iter.PageState()


Any specific reason to remove pagination here? I think pagination through tasks is much more efficient then issuing a limit query each time.

because the existing query is using limit
https://github.com/uber/cadence/blob/master/common/persistence/cassandraPersistence.go#L546

we can change to use the pagination later, this is not a blocking issue

samarabbas · 2018-04-16T16:12:39Z

service/history/historyEngineInterfaces.go

+
+	// error which will be thrown if the timer / transfer task should be
+	// retries due to various of reasons
+	taskRetryError struct{}


It is weird to use an empty struct as an error. Can you instead use errors.New?

nop, i need to use a dedicated error so the worker can do retry accordingly.

samarabbas · 2018-04-16T17:07:24Z

service/history/shardContext.go

@@ -51,6 +51,10 @@ type (
 		UpdateTransferAckLevel(ackLevel int64) error
 		GetReplicatorAckLevel() int64
 		UpdateReplicatorAckLevel(ackLevel int64) error
+		GetTimerAckLevel() time.Time


timerAckLevel should consist of both visibilityTime and increasing taskID. This seems like an issue that we only use visibilityTime to represent the ackLevel. We should be using both visibilityTime and task_id for timer tasks otherwise it could result in losing timers.

when we do scan from DB, we always scan from the current ack level (inclusive) (time.Time) to int64max.
https://github.com/uber/cadence/blob/master/common/persistence/cassandraPersistence.go#L577
so, each time, when we do a scan from DB, we will not lose timers.

samarabbas · 2018-04-16T17:20:26Z

service/history/timerQueueAckMgr.go

+		maxAckLevel time.Time
+		// isReadFinished indicate timer queue ack manager
+		// have no more task to send out
+		isReadFinished bool


why do you need to separate fields to indicate the cursor has reached the end? You could just close the finishedChan and isReadFinished API can just check if the channel is closed.

the finishedChan is used by the workers to see whether to stop.
when the read is finished, the worker still have to process the existing tasks in read, which can take time.

samarabbas · 2018-04-16T17:33:17Z

service/history/timerQueueAckMgr.go

-	config := shard.GetConfig()
-	ackLevel := shard.GetTimerAckLevel(clusterName)
+func newTimerQueueAckMgr(shard ShardContext, metricsClient metrics.Client, clusterName string, logger bark.Logger) *timerQueueAckMgrImpl {
+	ackLevel := TimerSequenceID{VisibilityTimestamp: shard.GetTimerClusterAckLevel(clusterName)}


We should use both the VisibilityTimestamp and TaskID to initialize ack level.

TaskID in by default 0, which is the minimum possible.
Moreover, we do not store the TaskID in DB.

samarabbas · 2018-04-16T17:58:40Z

service/history/timerQueueAckMgr.go

+	if t.isFailover && !morePage {
+		t.isReadFinished = true
+	}
+
 	// fillin the retry task


Is the comment still relevant?

samarabbas · 2018-04-17T16:42:21Z

service/history/timerQueueProcessor.go

 	}
+	go t.completeTimersLoop()


Will we move to a model where deletion of timerTasks are managed by this separate cursor? Is there are way to still keep the old behavior before xdc is really turned on?

well, the whole flow is changed.
if you really want, i can certainly do something, like adding a lot of if else.

samarabbas · 2018-04-17T16:55:15Z

service/history/timerQueueProcessor.go

+
+LoadCompleteLoop:
+	for {
+		request := &persistence.GetTimerIndexTasksRequest{


If we only use the timestamp to read tasks then it could result in deleting extra timer tasks which are not processed by active cursor yet.

nop, that will not happen, let us sync.

samarabbas · 2018-04-17T16:58:06Z

service/history/timerQueueProcessor.go

+			// before shutdown, make sure the ack level is up to date
+			t.completeTimers()
+			return
+		case <-timer.C:


I'm ok to have a timer to wake up completeTimersLoop but we should create a notification channel where each cursor can notify the completeTimersLoop when the ackLevel moves.

this will not be a top priority issue, let us solve this later

samarabbas · 2018-04-17T17:00:18Z

service/history/timerQueueProcessor.go

@@ -104,3 +146,85 @@ func (t *timerQueueProcessorImpl) getTimerFiredCount(clusterName string) uint64
 	}
 	return standbyTimerProcessor.getTimerFiredCount()
 }
+
+func (t *timerQueueProcessorImpl) completeTimersLoop() {


Let's create a separate task to add emit some metric from this new logic. I think we are missing key metric which would help us investigate issues in production.

samarabbas · 2018-04-17T17:07:28Z

service/history/timerQueueStandbyProcessor.go

@@ -169,12 +197,13 @@ func (t *timerQueueStandbyProcessorImpl) processExpiredUserTimer(timerTask *pers
 				//
 				// we do not need to notity new timer to base, since if there is no new event being replicated
 				// checking again if the timer can be completed is meaningless
-				t.timerQueueAckMgr.retryTimerTask(timerTask)
-				return nil
+				return newTaskRetryError()


instead of creating a new error each time lets just define this as a global variable and reuse the same error.

wxing1292 requested a review from samarabbas April 2, 2018 05:41

wxing1292 force-pushed the timer-Q-failover branch from 17bf26f to ed030bd Compare April 2, 2018 05:50

wxing1292 mentioned this pull request Apr 2, 2018

add standby timer processing logic, separate existing timer processing logic into active & standby #639

Merged

samarabbas reviewed Apr 2, 2018

View reviewed changes

wxing1292 force-pushed the timer-Q-failover branch 5 times, most recently from df8ec34 to 4a7c840 Compare April 4, 2018 17:51

samarabbas reviewed Apr 4, 2018

View reviewed changes

wxing1292 force-pushed the timer-Q-failover branch 3 times, most recently from b3bad03 to 8197f4b Compare April 9, 2018 08:25

add timer failover ack manager & UT, wire failover ack manager to act…

1a1f0db

…ive timer processor

wxing1292 force-pushed the timer-Q-failover branch 10 times, most recently from dcb3011 to f08bd6b Compare April 11, 2018 02:41

wxing1292 mentioned this pull request Apr 11, 2018

separate transfer queue to active and standby processor #661

Merged

wxing1292 force-pushed the timer-Q-failover branch 3 times, most recently from 2126445 to 7493521 Compare April 12, 2018 19:02

move the check of domain of timer task from ack manager to processor,…

16f058c

… processor will try to process a standby task indefinitely, until task is finished.

wxing1292 force-pushed the timer-Q-failover branch from 7493521 to 16f058c Compare April 12, 2018 22:54

add configuration check to enable standby timer processor

a33303d

wxing1292 force-pushed the timer-Q-failover branch from 46d7760 to a33303d Compare April 16, 2018 17:54

samarabbas reviewed Apr 17, 2018

View reviewed changes

wxing1292 force-pushed the timer-Q-failover branch from c834a97 to 7eaecd8 Compare April 17, 2018 19:22

address comments

a3143d1

wxing1292 force-pushed the timer-Q-failover branch from 7eaecd8 to a3143d1 Compare April 17, 2018 20:01

samarabbas approved these changes Apr 17, 2018

View reviewed changes

Merge branch 'master' into timer-Q-failover

35f818f

wxing1292 merged commit c4a6e1d into master Apr 18, 2018

wxing1292 deleted the timer-Q-failover branch April 18, 2018 18:18

implemented functionality in timer ack manager & timer processor base to allow failover #648

implemented functionality in timer ack manager & timer processor base to allow failover #648

Conversation

wxing1292 commented Apr 2, 2018 • edited Loading

coveralls commented Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samarabbas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wxing1292 commented Apr 2, 2018 •

edited

Loading

coveralls commented Apr 2, 2018 •

edited

Loading