Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/removing cross cluster feature #6121

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Jun 6, 2024

What changed?
This mostly* removes the cross-cluster feature.

Background

The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region.

Problems

The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow.
The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable.
The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error #5919

Decision to deprecate and alternatives

As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available.

If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed.

PR details

This is a fairly high-risk refactor so it'll take some time to land. Broadly it:

  • Entirely removes the cross-cluster feature and behaviour from workflow execution
  • Leaves the API, Enums and persistence layer untouched. The intention is that a followup PR will remove the persistence-layer parts of the Cross-cluster feature.

Notable callouts

  • This likely fixes a few bugs around failovers, as the current cross-cluster behaviour treats domain-not-active errors as an error to swallow which is a clear race condition
    -It probably contributes to errors between parent/child workflows just due to the sheer complexity of the code added, this is large simplification.

Testing

This is a pretty high risk change and the bar for testing should be fairly high, so I'll update the manual testing in this table as it's done:

test status
checking a simple hello world workflow passed
Simple parent/child workflow passed
parent close policy - cancel child wf fixed/passed
parent close policy - terminate child wf fixed/passed
parent close policy - abandon child wf fixed/passed
child wf closing - completion passed
child wf closing - term passed
child wf closing - cancel passed

there's obviously a bunch more possibilities with continue-as-new here too, but at a certain point I'm giong to have to rely on automation. There's been extremely little changes to the integration tests

@davidporter-id-au davidporter-id-au changed the title Refactor/removing cross cluster Refactor/removing cross cluster feature Jun 6, 2024
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 91.60305% with 11 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (34cfbb3) to head (13bfc7e).

Current head 13bfc7e differs from pull request most recent head 2abd7e5

Please upload reports for the commit 2abd7e5 to get more accurate results.

Additional details and impacted files
Files Coverage Δ
common/persistence/data_manager_interfaces.go 95.48% <100.00%> (-0.03%) ⬇️
common/persistence/data_store_interfaces.go 100.00% <ø> (ø)
common/persistence/execution_manager.go 88.05% <ø> (-0.11%) ⬇️
common/persistence/metered.go 0.00% <ø> (ø)
...n/persistence/nosql/nosqlplugin/cassandra/shard.go 100.00% <ø> (ø)
common/persistence/serialization/getters.go 93.20% <ø> (-0.07%) ⬇️
common/persistence/shard_manager.go 90.19% <100.00%> (+2.92%) ⬆️
common/persistence/sql/sql_shard_store.go 96.12% <100.00%> (-0.30%) ⬇️
common/persistence/statsComputer.go 95.02% <ø> (-0.12%) ⬇️
...rvice/history/engine/engineimpl/describe_queues.go 0.00% <ø> (ø)
... and 12 more

... and 42 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34cfbb3...2abd7e5. Read the comment docs.

@coveralls
Copy link

coveralls commented Jun 6, 2024

Pull Request Test Coverage Report for Build 018fee66-47e6-45df-9b2d-f982d0e49d3a

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 309 unchanged lines in 23 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.137%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/task/transfer_standby_task_executor.go 2 86.94%
common/task/parallel_task_processor.go 2 93.06%
service/history/replication/task_processor.go 2 82.76%
common/util.go 2 91.84%
service/matching/tasklist/task_writer.go 2 82.21%
common/persistence/metered.go 2 80.87%
service/history/task/fetcher.go 2 83.13%
service/history/execution/mutable_state_builder.go 3 78.26%
service/history/handler/handler.go 4 96.43%
common/persistence/wrappers/errorinjectors/utils.go 6 91.41%
Totals Coverage Status
Change from base Build 018fedb8-ecd7-4675-ba4a-3dd7f0818e3a: -0.1%
Covered Lines: 103817
Relevant Lines: 145939

💛 - Coveralls

@@ -61,10 +61,9 @@ var (
)

var (
errUnknownTransferTask = errors.New("unknown transfer task")
errWorkflowBusy = errors.New("unable to get workflow execution lock within specified timeout")
errTargetDomainNotActive = errors.New("target domain not active")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Removing this category of error since it doesn't make sense without cross-cluster feature.

I think the removal of this will make domain not active errors behave correctly during failover

// expected error, no-op
break
default:
if _, ok := err.(*types.EntityNotExistsError); !ok {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this should be using errors.As/Is, however, i'm copy+pasta reverting here,

Copy link
Member

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reviewing. will complete Monday

Comment on lines -269 to -270
crossClusterPQS,
crossClusterPQSEncoding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateShard can set these fields to nil/empty so the existing shard records get leaner. And then in a follow up PR get rid of these completely.
Alternatively we can write a script to do that. Up to you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do the persistence changes in a separate PR because they're going to also be quite large, but should't affect any application code changes. They're also high-risk to change, so I didn't want to add to an already quite big PR

@coveralls
Copy link

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 019009dd-83db-4f9b-9198-5b8b217ab621

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 341 unchanged lines in 26 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.095%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
common/task/weighted_round_robin_task_scheduler.go 1 88.06%
service/matching/tasklist/db.go 2 73.23%
service/history/replication/task_processor.go 2 82.76%
common/util.go 2 91.84%
common/persistence/metered.go 2 80.87%
common/persistence/historyManager.go 2 66.67%
common/log/tag/tags.go 3 50.46%
common/persistence/nosql/nosql_task_store.go 3 85.52%
common/task/fifo_task_scheduler.go 3 84.54%
service/history/execution/mutable_state_builder.go 3 78.26%
Totals Coverage Status
Change from base Build 018fedb8-ecd7-4675-ba4a-3dd7f0818e3a: -0.1%
Covered Lines: 103640
Relevant Lines: 145776

💛 - Coveralls

Copy link
Member

@3vilhamster 3vilhamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be safe to delete. If there is still some dangling code related to the cross-cluster feature, it should be safe to clean up afterward.

@@ -264,6 +265,7 @@ const (
TransferTaskTransferTargetRunID = "30000000-0000-f000-f000-000000000002"
// CrossClusterTaskDefaultTargetRunID is the the dummy run ID for cross-cluster tasks of types
// that do not have a target workflow
// This is deprecated as of May 24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Deprecated: This is deprecated as of May 24

See https://go.dev/wiki/Deprecated

That is supported by IDEs and such.

@davidporter-id-au
Copy link
Member Author

I think it should be safe to delete. If there is still some dangling code related to the cross-cluster feature, it should be safe to clean up afterward.

appreciate the review, it's annoyingly large. Yeah, this is basically roughly only half of the changes, I've not touched anything in persistence yet. I expect there'll be a few other bits dangling as well

@coveralls
Copy link

coveralls commented Jun 26, 2024

Pull Request Test Coverage Report for Build 01905267-e947-48a2-a33f-1ee19581946d

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 317 unchanged lines in 24 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.412%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/task/parallel_task_processor.go 2 93.06%
common/persistence/metered.go 2 80.87%
service/history/queue/timer_queue_processor_base.go 3 77.87%
service/history/execution/mutable_state_builder.go 3 78.26%
service/history/task/transfer_standby_task_executor.go 4 87.35%
service/history/handler/handler.go 4 96.43%
common/task/fifo_task_scheduler.go 4 83.51%
service/frontend/api/handler.go 4 75.68%
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104302
Relevant Lines: 146057

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 26, 2024

Pull Request Test Coverage Report for Build 019052cc-e584-4973-80fd-d356acfcec68

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 327 unchanged lines in 27 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.418%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.93%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
common/task/fifo_task_scheduler.go 2 87.63%
common/persistence/metered.go 2 80.87%
service/matching/tasklist/matcher.go 2 90.18%
service/matching/tasklist/task_reader.go 2 77.72%
service/history/execution/mutable_state_builder.go 3 78.26%
common/persistence/statsComputer.go 3 98.18%
service/history/task/transfer_standby_task_executor.go 4 87.35%
common/archiver/filestore/historyArchiver.go 4 80.95%
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104314
Relevant Lines: 146061

💛 - Coveralls

@davidporter-id-au davidporter-id-au enabled auto-merge (squash) June 26, 2024 18:25
@coveralls
Copy link

coveralls commented Jun 26, 2024

Pull Request Test Coverage Report for Build 019056c1-7321-41c4-9414-556ee6511194

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 340 unchanged lines in 25 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.394%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
common/task/fifo_task_scheduler.go 2 84.54%
common/persistence/metered.go 2 80.87%
service/matching/tasklist/matcher.go 2 90.91%
service/matching/tasklist/task_reader.go 2 77.72%
service/history/task/task.go 3 84.81%
service/history/execution/mutable_state_builder.go 3 78.39%
common/persistence/statsComputer.go 3 98.18%
service/history/handler/handler.go 4 96.43%
service/history/queue/timer_queue_processor_base.go 4 77.66%
Totals Coverage Status
Change from base Build 01903cd7-c1ac-49f3-a7a4-fe9da6c16ce7: -0.1%
Covered Lines: 104279
Relevant Lines: 146061

💛 - Coveralls

@davidporter-id-au davidporter-id-au merged commit 03d9a2e into cadence-workflow:master Jun 27, 2024
19 checks passed
@coveralls
Copy link

coveralls commented Jun 27, 2024

Pull Request Test Coverage Report for Build 01905720-e130-46fd-aa35-96725d16add5

Details

  • 118 of 134 (88.06%) changed or added relevant lines in 9 files are covered.
  • 314 unchanged lines in 27 files lost coverage.
  • Overall coverage decreased (-0.1%) to 71.443%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/persistence-tests/persistenceTestBase.go 0 2 0.0%
service/history/handler/handler.go 0 2 0.0%
service/history/task/transfer_active_task_executor.go 36 38 94.74%
common/persistence/persistence-tests/shardPersistenceTest.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
service/history/shard/context.go 1 78.13%
service/history/task/transfer_standby_task_executor.go 2 87.04%
common/task/weighted_round_robin_task_scheduler.go 2 89.05%
service/matching/tasklist/task_list_manager.go 2 76.65%
common/persistence/sql/sqlplugin/mysql/task.go 2 73.68%
common/persistence/metered.go 2 80.87%
common/membership/hashring.go 2 84.69%
service/matching/tasklist/matcher.go 2 90.91%
service/matching/tasklist/task_reader.go 2 77.72%
common/persistence/sql/sqlplugin/mysql/db.go 2 79.49%
Totals Coverage Status
Change from base Build 019056dc-98d1-4fa6-b475-a7aef51f4b90: -0.1%
Covered Lines: 104692
Relevant Lines: 146539

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants