New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Task manager - task cleanup on passive side using task completer #6514

Merged

fimanishi merged 10 commits into cadence-workflow:master from fimanishi:task-list-manager-passive-tasks-cleanup-poc

Nov 26, 2024

Member

fimanishi commented Nov 20, 2024

What changed?
This change adds the TaskCompleter to the TaskListManager.

The TaskCompleter is used to clean up started tasks in the domain's standby cluster. For every dispatched task, from the TaskReader to the TaskMatcher, the TaskCompleter checks if the current cluster is the standby. If it's not (it's the active cluster), the task is dispatched to MustOffer in the TaskMatcher as it's normally done. If it's the standby cluster, it gets the WorkflowExecution using the DescribeWorkflowExecution history endpoint, and checks whether the task has been started in the active cluster or not. If it's not, it retries. If it is, it marks the task completed the same way it's currently done in the active side. This advances the ackLevel and calls the gc.

Why?
Cadence does not process activity/decision tasks on a domain's standby cluster. It only process query tasks. That's by design and works as intended. But it has a side effect that the tasks added to the tasks table rely on their TTL to be removed from it. This side effect contributes to:

large partitions observed on the domain's standby cluster. Not only it's a storage cost, but it can also prohibit domains from failing over due to very large partition size
too many calls to the database after a failover, that incurs in database latency as well as workflow completion latency after a failover

These changes aims to actively complete (remove from the database) tasks that have already been started in the active cluster. It does it in a sequential way the same way that the active cluster does it, leveraging the advance of the ackLevel and also performing range deletions using the garbage collector, minimizing the database calls and possible performance issues (such as tombstones in cassandra).

How did you test it?
Created unit tests for it and tested on a multicluster setup locally.

Potential risks
The taskCompleter is inserted in the TaskListManager dispatcher, which is the way that all async tasks are matched to pollers. It intercepts the dispatches in order to determine if the cluster is the standby cluster or not. An issue here will also disturb the active cluster and most likely interrupt the processing of async tasks.

Release notes
Active completion of already started tasks on the domain's standby cluster.

Documentation Changes

fimanishi marked this pull request as ready for review

November 22, 2024 21:18

fimanishi requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners

November 22, 2024 21:18

natemort reviewed

View reviewed changes

common/util.go Show resolved Hide resolved

service/matching/tasklist/task_completer.go

+              	}
+              	if !errors.Is(err, errDomainIsActive) && !errors.Is(err, errTaskNotStarted) {
+              		tc.logger.Error("Error completing task on domain's standby cluster", tag.Error(err))

Member

natemort Nov 22, 2024

Nit: Maybe emit a metric here? This seems like a weird case that we might want to monitor

Member

taylanisikdemir Nov 25, 2024

context timeouts would fall in this branch which would happen during deployments/restarts. Still fine to log in those cases

service/matching/tasklist/task_completer.go Outdated

+              		}
+              		tc.scope.
+              			Tagged(metrics.DomainTag(task.domainName)).

Member

natemort Nov 22, 2024

giga-Nit: I think we might get these tags from the current scope since we're using the tasklist manager's scope. If not I think it would make sense to construct a new scope so that we have a consistent set of dimensions for all these metrics if we forget to include them.

davidporter-id-au reviewed

View reviewed changes

service/matching/tasklist/task_list_manager.go Show resolved Hide resolved

davidporter-id-au reviewed

View reviewed changes

service/matching/tasklist/task_list_manager.go Show resolved Hide resolved

davidporter-id-au reviewed

View reviewed changes

service/matching/tasklist/task_completer.go Show resolved Hide resolved

davidporter-id-au reviewed

View reviewed changes

service/matching/tasklist/task_completer.go

+              		if errors.As(err, new(*types.EntityNotExistsError)) {
+              			return nil
+              		} else if err != nil {
+              			return fmt.Errorf("unable to fetch workflow execution from the history service: %w", err)

Member

davidporter-id-au Nov 23, 2024 •

edited

Loading

I'm not sure I understand not-found as a special case?

Mind adding a log here with some info such as the workflow ID and so fourth, this would near certainly be a bug, but in cases were we see unstuck tasklists it's forseeable. We probably wnat to know about such events, for both types of errors

Member Author

fimanishi Nov 23, 2024

The case here is that it could take a while for the task list manager in the standby side to actually attempt to process the task (although it's unlikely), and based on the retention period for the workflow, we could be trying to get information about a workflow that is not present in the db anymore. Does that make sense?

Member

davidporter-id-au Nov 25, 2024

I would put it as a warn at least, I'd be pretty surprised for the average retention (7 days) to exceed the duration of somehting in the tasks table

davidporter-id-au reviewed

View reviewed changes

service/matching/tasklist/task_completer.go Show resolved Hide resolved

taylanisikdemir reviewed

View reviewed changes

service/matching/tasklist/task_list_manager.go Outdated

+              		return fmt.Errorf("unable to fetch domain from cache: %w", err)
+              	}
+              	if _, err = domainEntry.IsActiveIn(c.clusterMetadata.GetCurrentClusterName()); err == nil {

Member

taylanisikdemir Nov 25, 2024

Let's also check if returned bool value to determine active vs passive. Current implementation of IsActiveIn() always returns an error when it's returning false value but that doesn't have to be the case in the future.
I checked usages of IsActiveIn() on history side which uses the bool and ignores the err.

Member

davidporter-id-au Nov 25, 2024

+1

taylanisikdemir approved these changes

View reviewed changes

davidporter-id-au approved these changes

View reviewed changes

fimanishi added 9 commits

November 25, 2024 15:54


          Task manager - task cleanup on passive side using task completer

35a3b61


          Rename config and added it to test that was missing it

f686602


          Cleanup

eaec67f


          Add metrics and log, further cleanup

bde6ada


          Lint

84abd0e


          Added metric for error completing task and removed unnecessary tags o…

44389e1

…n scope


          Addressing comments

7c51f70


          Changed tests to address changes to code

e52e945


          Changed logger from debug to info for EntityNotExistsError

ac59778


          Changed logger from info to warn and check isActiveIn bool instead of…

749393d

… error

fimanishi force-pushed the task-list-manager-passive-tasks-cleanup-poc branch from 868e8dc to 749393d Compare

November 26, 2024 00:02

fimanishi merged commit 33f755a into cadence-workflow:master

17 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

natemort natemort left review comments

davidporter-id-au davidporter-id-au approved these changes

taylanisikdemir taylanisikdemir approved these changes

Shaddoll Awaiting requested review from Shaddoll Shaddoll is a code owner

neil-xie Awaiting requested review from neil-xie neil-xie is a code owner

Groxx Awaiting requested review from Groxx Groxx is a code owner

shijiesheng Awaiting requested review from shijiesheng shijiesheng is a code owner

jakobht Awaiting requested review from jakobht jakobht is a code owner

3vilhamster Awaiting requested review from 3vilhamster 3vilhamster is a code owner

sankari165 Awaiting requested review from sankari165 sankari165 is a code owner

dkrotx Awaiting requested review from dkrotx dkrotx is a code owner

demirkayaender Awaiting requested review from demirkayaender demirkayaender is a code owner

Labels

None yet