v1.0.0
We are v1.0! (with a schema upgrade)
What does this mean?!
Not much. Primarily that we are declaring "it's stable and in use" more visibly, because we continually get questions about this :) A larger public announcement / state-of-the-project is in the works.
Importantly, v1.0 does not imply any change to backwards compatibility (the minimum supported client version has not changed), RPC compatibility (ditto, all changes are backwards compatible), or Go API compatibility (this is not truly a library, Go compatibility is not a goal).
Going by previous version patterns, this would have been labeled v0.26.0 as it is a relatively incremental change (plus schema changes) from v0.25.0. As such, some strings still reference "0.26", because this older SHA is the one we have been using the most internally.
These strings will be updated and validated soon, and will likely be released as v1.0.1. This should have no behavioral impact at all, but will be visible in metrics, logs, and display strings.
What do I need to do to upgrade?
Schema upgrades needed
There have been schema changes to both normal and visibility datastores, primarily to provide better data for cleanup and hot-shard detection:
- Update-time additions by @neil-xie in #4962 and #4971
- Add FirstExecutionRunID to mutable state by @Shaddoll in #5031
- Shard ID visibility additions by @allenchen2244 in #5099 and #5123
These were intentionally kept out of v0.25.0 to keep that upgrade simple, as they were not fully utilized yet.
Replication cache recommendation
We have internally disabled the replication cache (history.replicatorCacheCapacity
dynamic config set to 0
), due to unexpectedly large memory use under abnormal load, and you may wish to do so as well.
We did not encounter any misbehavior, and it did reduce database load as intended, but we intend to make some changes to it to estimate and constrain memory use before re-enabling.
What has changed?
At a very high level, we've been focused on:
- Internal scaling challenges, both improving bottlenecks and improving our ability to accurately identify bottlenecks
- Many metrics, logs, and refactors are at least somewhat related to this
- Our multi-cluster support is improved in particular, as we have been connecting clusters and moving many domains to spread load more evenly
- Database corruptions, as our Cassandra clusters have had some problems that cause issues for months
- Many logs, scanner, and stale-task changes are related to this, e.g. to detect and remove invalid data
- Scaling up the team
- More changes to come!
Some loosely categorized PRs that were included follows:
Critical bugfixes (resolving issues in v0.25.0)
- Fix ndc flush buffered events by @Shaddoll in #5009
- Hotfix a replication panic causing crashes by @davidporter-id-au in #5074
- Resolve an infinite loop around impossible cron schedules by @Groxx in #5097
Parent-close-policies apply to child workflows even after they reset/continue-as-new/etc
- Update parent close policy to terminate/cancel child workflows even after continue as new by @Shaddoll in #5032
- This requires new stored data, so it does not apply to child workflows started before this version.
Better config introspection
- Config store CLI: make value required when updating by @mantas-sidlauskas in #5089
- CLI: print all available dynamic config keys by @mantas-sidlauskas in #5090
Schemas are now available via the go module, as go:embed files
- Embed schema files by @Shaddoll in #5040
- Embed elasticsearch index templates by @Shaddoll in #5043
- Fix ES embedding by @Shaddoll in #5056
Enhancing existing metrics and logging (and more included in other PRs)
- Reduce metrics cardinality replication.TaskStore by @vytautas-karpavicius in #4981
- Add Metric Emitter, which right now emits a metric once a minute for true replication lag in nanoseconds. by @ZackLK in #4979
- Added logs for domainName empty situation by @abhishekj720 in #4987
- Improve logs for task executor by @Shaddoll in #4989
- Add domain_type and cluster_groups tags by @vytautas-karpavicius in #4990
- Introduce per domain metrics by @Shaddoll in #5012
- Improve logs for transfer task validator by @Shaddoll in #5044
- Make replication log error message better by @davidporter-id-au in #5052
- Wf version metrics by @allenchen2244 in #5041
- Add domain tag to unregistered field error by @neil-xie in #5070
- UpdateWorkflow ShardId based metrics by @allenchen2244 in #5080
- Emit workflow counts per workflow type metrics by @neil-xie in #5082
- Use zap logger when initialising dynamic config by @mantas-sidlauskas in #5081
- add 3 tags to support adding logs for every manual access by @bowenxia in #5112
- Add sample log and dynamic config for updateworkflowexecution hot shard detection by @allenchen2244 in #5120
- Add attempt-count to task processing logs, and update unit test so that it will cover deadlock by @bowenxia in #5122
Misc
- Allow docker compose to work with docker-compose-mysql.yml on M1 by @ZackLK in #4983
- Return early when there are no replication tasks by @vytautas-karpavicius in #4982
- Update Cassandra deletes to use ALL consistency level by @Shaddoll in #4984
- Make test should pass locally by @ZackLK in #4915
- Immediate replication task hydration after successful transaction by @vytautas-karpavicius in #4980
- Convert client peer resolving errors to service transient errors by @Shaddoll in #4993
- Update idls by @Shaddoll in #4997
- Fix history corruption check for workflow signaling by @Shaddoll in #4998
- Introduce a dynamic config for cassandra all consistency level delete by @Shaddoll in #5000
- Adds fix for domain ack level issue by @davidporter-id-au in #5001
- Drop dynamic config for gRPC message size by @vytautas-karpavicius in #5002
- Fix Cadence CLI by @Shaddoll in #5005
- Re-enable workflow test by @Shaddoll in #5007
- Add new unit test by @Shaddoll in #5008
- Reformatting most things for go 1.19, rebuilding go.mod tools after clean, warning about different go versions by @Groxx in #5019
- Enhance workflowDeletionTaskJitterRange to handle deletes piling up when many workflows have finished at the same time. by @ZackLK in #5020
- Feature/min initial failover version by @davidporter-id-au in #5015
- Fix Makefile OpenSearch rule name in CONTRIBUTING.md install guide, Fix OpenSearch version in dev Docker config by @charlese-instaclustr in #5004
- Decouple StateBuilder from TaskGenerator by @vytautas-karpavicius in #4991
- Removing unused code by @vytautas-karpavicius in #5024
- Use internal IndexedValueType by @Shaddoll in #5016
- Fix workflow cancellation by @Shaddoll in #5025
- Add UpdateTime to uninitialized workflow execution record and update logic to set the update time by @neil-xie in #5014
- Update DSL query to allow filtering by missing start time by @neil-xie in #5017
- test: use
T.TempDir
to create temporary test directory by @Juneezee in #5013 - Enable workflow corruption check for Describe and Query API by @Shaddoll in #5028
- Remove unused watchdog signal by @demirkayaender in #5029
- Add TLS ServerName as CLI option for Cadence Cassandra Tool by @sonpham96 in #5011
- Add cli tls support by @charlese-instaclustr in #5027
- Improve Cassandra errors for schema check by @mantas-sidlauskas in #5038
- Fix SignalWithStartWorkflow by @Shaddoll in #5036
- Fix error message by @ZackLK in #5045
- Making a schema tooling concrete -> interface by @davidporter-id-au in #5046
- Exposing the ability to pull CQL changesets by @davidporter-id-au in #5047
- Corrects interface by @davidporter-id-au in #5049
- Third attempt to finish exposing all of interface by @davidporter-id-au in #5050
- Optimize SQL layer supporting batch delete by @Shaddoll in #5053
- Exposes schema task by @davidporter-id-au in #5051
- Search attribute validation toggling by @charlese-instaclustr in #5033
- Do not return not exists error in history pagination function by @Shaddoll in #5054
- Delete uninitialized workflow execution record if workflow failed to start by @neil-xie in #5059
- Fix make install-schema-es-v6 and install-schema-es-v7 by @neil-xie in #5063
- change to emit wf version by @allenchen2244 in #5066
- Update dependencies by @mindaugasbarcauskas in #5065
- Fix docker image builds with an actually-reliable dependency skip by @Groxx in #5071
- Fix resurrection check for timer and activity by @Shaddoll in #5077
- Add min_event_id,max_event_id flags to admin workflow show by @Shaddoll in #5083
- Update CLI to support decoding HistoryBranch by @Shaddoll in #5069
- Add iWF link in README by @longquanzheng in #5084
- Small refactoring of taskListManger by @Shaddoll in #5091
- Small refactoring of task writer by @Shaddoll in #5092
- Small refactoring of taskReader by @Shaddoll in #5095
- Unload taskListManager by instance, not taskListID by @Shaddoll in #5101
- Create a helper function to handle ConditionFailedError by @Shaddoll in #5102
- Remove
maxQPS
from sql plugin documentation by @mantas-sidlauskas in #5107 - Separate liveness of task list into a dedicated entity by @Shaddoll in #5105
- Flexible / sane header forwarding by @Groxx in #5103
- [history] add domain status check in taskfilter by @shijiesheng in #5140
- [history] more cautious in deciding domain state to make decisions on dropping queued tasks by @shijiesheng in #5164
New Contributors
Full Changelog: v0.25.0...v1.0.0