Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up some lightweight automation for publishing docker images #1

Open
wants to merge 661 commits into
base: master
Choose a base branch
from

Conversation

Groxx
Copy link
Owner

@Groxx Groxx commented Apr 24, 2023

We have this in an internal wiki, but it's still quite manual, and there are a lot of steps and docker arguments and whatnot.

Since I got tired of following it by hand, and we seem to have missed some in the past: seems worth doing some basic automation.

vytautas-karpavicius and others added 30 commits May 10, 2022 15:30
* Fill domainID for backwards compatibility

* Added unit test
* Log error fields as tags

* Update common/log/loggerimpl/logger.go

Co-authored-by: Steven L <stevenl@uber.com>

* Fix syntax error

* Use zap ObjectMarshaler for nested fields

Co-authored-by: Steven L <stevenl@uber.com>
* Add logs for domain failover (uber#2359)

* Add operation name tag for domain update (uber#2359)

* Add error logs for domain update (uber#2359)

* Update logs to reuse the logger (uber#2359)
)

We have a user desiring this, and in general it seems like a good idea.
Activities are generally assumed to be "high cost" to lose, or at least potentially.

Longer term, we should probably consider making this a per-domain config,
rather than something that is hardcoded for a whole cluster.  Nothing
about this seems like it would be cluster-bound.
The suspicion is that this is not actually a transient and retriable
error, so this should be handled differently
* Simplify history builder

* Removed unused methods
* Removing target-domain-not-active special-case handling

The suspicion is that this is not actually a transient and retriable
error, so this should be handled differently

* Fixing remaining non-retriable error

* Fix test
* Decouple domain cache entry from cluster metadata

* Addressing comments

* Fixing test
subhash-veluru and others added 30 commits March 13, 2023 13:33
Service name should be `worker` not `workers`

Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>
…lds are filtered (uber#5151)

* add unit test for filter PII functions to check bugs and error when cloning

* handles when pointers are nil to avoid bugs and errors

* resume the changes from previous reverted branch

* use json tags to filter PII instead of hard copies

* Create a new struct in unit test that only contains PII. Would be much more clearer to see filtered result.

* some clean up
add remaining persistence stuff that goes to a shard
* added and update consistent query per shard metric

* testing pershard metric

* move sample logger into persistence metric client for cleaness

* fix test

* fix lint

* fix test again

* fix lint

* sample logging with workflowid tag

* added domain tag to logger

* metric completed

* addressing comments

* fix lint

* Revert "fix lint"

This reverts commit 1e96944.

* fix lint second attempt

---------

Co-authored-by: Allen Chen <allenchen2244@uber.com>
* ES: single interface for different ES/OpenSearch versions

* make fmt
* Elasticsearch: reduce code duplication

* address comments

---------

Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>
* Set poll interval for filebased dynamic config if not set

* update unit test
* Initial checkin for pinot config files
… dropping queued tasks (uber#5164)

What changed?

When domain cache returned entity not found error, don't drop queued tasks to be more conservative.

Why?

In cases when the cache is dubious, we shouldn't drop the queued tasks.
* add support for TLS connections by Canary, add development config for Canary with TLS

* update README to include new config option

* remove testing config

---------

Co-authored-by: David Porter <david.porter@uber.com>
Co-authored-by: Shijie Sheng <shengs@uber.com>
Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>
* Remove misleading type check, Add more detailed log message

* removing debugging logging

* Handle nil update edge case

---------

Co-authored-by: allenchen2244 <102192478+allenchen2244@users.noreply.github.com>
Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>
Co-authored-by: David Porter <david.porter@uber.com>
* Adds a small test to catch issues with deadlocks
* Add thin ES clients
uber#5185)

* remove validation & test for add search attribute with no advanced config

- Remove validation for Advance Visibility Store
- Add Advance Visibility Config check before update ElasticSearch/OpenSearch mapping
- Remove co-related test for 'no advanced config'

* Update CHANGELOG.md

Update CHANGELOG.md

* Add warn level message if skip updating OpenSearch/ElasticSearch mapping

* Add warn level message and add validSearchAttributes in development.yaml

---------

Co-authored-by: Quanzheng Long <prclqz@gmail.com>
* add shardid tag to log

* remove counter for overall scope

* fix lint
What changed?
Added a sharding layer to the NoSQL persistence stack so that Cadence can use multiple Cassandra clusters at once in a physically sharded manner.

Cadence is a heavily storage-bounded system, so the limits for the load per Cadence cluster is strictly limited by the underlying storage system. Given the massive adoption of Cadence at Uber, this scale limitation forces us to create more Cadence clusters than we want to operate. This capability will let us have one or two orders of magnitude larger Cadence clusters than we have today.

Note that this feature only enables bootstrapping a brand-new cluster with multiple databases behind it. Resharding is designed but not implemented yet.

Why?
So that a Cadence cluster can be bootstrapped with multiple Cassandra clusters powering it.

How did you test it?
Added unit tests. Ran samples and tested bench tests in a staging environment.

Potential risks
Since this change significantly changes the low-level persistence logic, it can cause data loss if something goes terribly wrong.

Release notes
The change is backward compatible. Existing Cadence cluster configurations can be updated, if desired, to use the sharded NoSQL config format. However, they must continue having a single shard since Cadence still doesn't have the ability to reshard data.

Documentation Changes
There is a sample config file included in this PR that shows how to make use of the feature in a new cluster.
…ts (uber#5218)

* add tasklist traffic metrics for decision task

* add logger, remove tasklistID

* add taskListCombined

* add more fields

* add forward metric and source

* fix nil

* add tlMgr metrics

* add more metrics

* remove tlMgr metric

* only emit metrics if not sticky and not forwarded

* create new metrics name for better distinction

* add new emitted info

* change nil to empty string

* add domain and tasklist name tags

* add metrics for forwarded tasklist

* new metrics for activity task, rename metrics to allow aggregation of both type of tasks

* clean up logging

* clean up changes in emitInfoOrDebugLog()

* resolve comments

* improve some logic

* fix small error
We have this in an internal wiki, but it's still quite manual, and there
are a lot of steps.
Some lightweight automation seems worth adopting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.