-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/removing cross cluster feature #6121
Refactor/removing cross cluster feature #6121
Conversation
…fixed up missing attributed for parent closed policy
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 42 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
@@ -61,10 +61,9 @@ var ( | |||
) | |||
|
|||
var ( | |||
errUnknownTransferTask = errors.New("unknown transfer task") | |||
errWorkflowBusy = errors.New("unable to get workflow execution lock within specified timeout") | |||
errTargetDomainNotActive = errors.New("target domain not active") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Removing this category of error since it doesn't make sense without cross-cluster feature.
I think the removal of this will make domain not active errors behave correctly during failover
// expected error, no-op | ||
break | ||
default: | ||
if _, ok := err.(*types.EntityNotExistsError); !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this should be using errors.As/Is, however, i'm copy+pasta reverting here,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reviewing. will complete Monday
crossClusterPQS, | ||
crossClusterPQSEncoding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdateShard can set these fields to nil/empty so the existing shard records get leaner. And then in a follow up PR get rid of these completely.
Alternatively we can write a script to do that. Up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to do the persistence changes in a separate PR because they're going to also be quite large, but should't affect any application code changes. They're also high-risk to change, so I didn't want to add to an already quite big PR
Pull Request Test Coverage Report for Build 019009dd-83db-4f9b-9198-5b8b217ab621Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be safe to delete. If there is still some dangling code related to the cross-cluster feature, it should be safe to clean up afterward.
@@ -264,6 +265,7 @@ const ( | |||
TransferTaskTransferTargetRunID = "30000000-0000-f000-f000-000000000002" | |||
// CrossClusterTaskDefaultTargetRunID is the the dummy run ID for cross-cluster tasks of types | |||
// that do not have a target workflow | |||
// This is deprecated as of May 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Deprecated: This is deprecated as of May 24
See https://go.dev/wiki/Deprecated
That is supported by IDEs and such.
appreciate the review, it's annoyingly large. Yeah, this is basically roughly only half of the changes, I've not touched anything in persistence yet. I expect there'll be a few other bits dangling as well |
Pull Request Test Coverage Report for Build 01905267-e947-48a2-a33f-1ee19581946dDetails
💛 - Coveralls |
Pull Request Test Coverage Report for Build 019052cc-e584-4973-80fd-d356acfcec68Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 019056c1-7321-41c4-9414-556ee6511194Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 01905720-e130-46fd-aa35-96725d16add5Details
💛 - Coveralls |
What changed?
This mostly* removes the cross-cluster feature.
Background
The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region.
Problems
The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow.
The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable.
The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error #5919
Decision to deprecate and alternatives
As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available.
If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed.
PR details
This is a fairly high-risk refactor so it'll take some time to land. Broadly it:
Notable callouts
-It probably contributes to errors between parent/child workflows just due to the sheer complexity of the code added, this is large simplification.
Testing
This is a pretty high risk change and the bar for testing should be fairly high, so I'll update the manual testing in this table as it's done:
there's obviously a bunch more possibilities with continue-as-new here too, but at a certain point I'm giong to have to rely on automation. There's been extremely little changes to the integration tests