Enabling service-discovery driven shutdown of matching engine #6198

davidporter-id-au · 2024-07-30T07:05:15Z

What changed?

Matching hosts presently are at quite a lot of risk of delaying processing or getting contended during shard ownership changes, particularly if the service-discovery changes occur before the container shutdown is kicked off.

This change attempts to be somewhat more proactive in cleaning up and guarding against shard changes so that the Matching host reliquishes ownership and processing of the tasklist manager it may have lost earlier, rather than fighting with the new owner by incrementing the shard counter and delaying processing.

This can happen in a few ways: A host shutdown is taking a while, but the new host is reacting to service-discovery and taking ownership of the original shards, but hitting a taskreader still attempting to take lock, for one example. Another is a scaling-up event where the shards are stolen from an existing host.

Why?

How did you test it?

This has been tested by deploying in a couple of pre-prod envs in a few iterations, but at this point in time needs a bit more manual testing.

Testing currently:

Preliminary testing in development environments
Unit testing
Manual desting in staging

Feature enablement

This feature requires the bool flag matching.enableTasklistGuardAgainstOwnershipLoss

service/matching/membership.go

…iscovery

common/errors/taskListNotOwnedByHostError.go

service/matching/handler/engine.go

service/matching/handler/membership.go

service/matching/handler/engine.go

Shaddoll · 2024-08-06T22:43:55Z

service/matching/handler/engine.go

 }

 func (e *matchingEngineImpl) Stop() {
+	close(e.shutdown)


we're not waiting for subscribeToMembershipChanges to complete, possibly goroutine leak?

This is true: I was too lazy and didn't think to add a full WG setup to engine. What do you think? It does mean that I didn't toggle it on for the leak-detector

+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges returns (via waitgroup)

Ok, I don't love adding a bunch of complexity back to the shutdown process, but fair. Added a waitgroup.

…iscovery

jakobht

Looks good, I like this approach a lot! Some minor comments

service/matching/handler/engine.go

common/errors/taskListNotOwnedByHostError.go

service/matching/handler/membership.go

service/matching/handler/engine.go

Co-authored-by: Jakob Haahr Taankvist <jht@uber.com>

…iscovery

service/matching/handler/engine_integration_test.go

davidporter-id-au · 2024-08-13T07:50:32Z

common/membership/resolver.go

@@ -84,6 +85,7 @@ type MultiringResolver struct {
 	status  int32

 	provider PeerProvider
+	mu       sync.Mutex
 	rings    map[string]*ring


since this is the first instance of multiple accessors to the rings map, it needs guards

codecov · 2024-08-13T08:19:46Z

Codecov Report

Attention: Patch coverage is 93.10345% with 8 lines in your changes missing coverage. Please review.

Project coverage is 72.92%. Comparing base (de281a6) to head (25e649b).
Report is 6 commits behind head on master.

Files	Patch %	Lines
service/matching/handler/membership.go	87.09%	4 Missing and 4 partials ⚠️

Additional details and impacted files

Files	Coverage Δ
common/membership/resolver.go	`80.24% <100.00%> (+1.86%)`	⬆️
service/matching/config/config.go	`100.00% <100.00%> (ø)`
service/matching/handler/engine.go	`79.94% <100.00%> (+0.93%)`	⬆️
service/matching/handler/membership.go	`87.09% <87.09%> (ø)`

... and 7 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de281a6...25e649b. Read the comment docs.

service/matching/handler/engine.go

taylanisikdemir · 2024-08-13T19:36:37Z

service/matching/handler/engine.go

 }

 func (e *matchingEngineImpl) Stop() {
+	close(e.shutdown)


+1 for not leaving goroutines behind. let's wait here until subscribeToMembershipChanges returns (via waitgroup)

service/matching/handler/membership.go

taylanisikdemir · 2024-08-13T19:45:33Z

service/matching/handler/engine.go

+		// can take a while, so do them in parallel so as to not hold the
+		// lock too long.


which lock is being held here? If this comment belongs to a previous iteration then let's just stop the tasklists synchronously. Better to not return from shutDownNonOwnedTasklists while some background goroutines are still stopping task lists. shutDownNonOwnedTasklists may be called again via ticker and might try to do the same/overlapping work and cause unexpected problems.

It does refer to an earlier iteration where it was locking, so let me fix.

My concern with doing it serially is that the net time to shut down will be quite slow, the tlmgr.Close() method does a lot of IO. What about throwing in a waitgroup and doing it pararallel?

doing parallel should be fine I think. just avoid leaving them behind before returning

service/matching/handler/engine.go

taylanisikdemir · 2024-08-13T19:55:24Z

service/matching/handler/engine.go

+	if !e.config.EnableTasklistOwnershipGuard() {
+		return nil
+	}
+	noLongerOwned, err := e.getNonOwnedTasklistsLocked()


Preemptively stopping tasklists based on service-discovery result might cause "freeze" on a tasklist whose new owner didn't claim it yet. Ideally the old host performs operation on it until the new owner host claims the lease. Is that something we can easily check from DB? (I guess not)

the duration of that wait I would expect to be the startup time of a shard + the latency of service-discovery propogation (interally for us, 2 seconds). I do think that it's a slight risk, but thats' not super far off the duration of a LWT, so I'm not sure doing a CAS operation to poll or something would be worthwhile.

I think this change should perform better than what we have right now but the ideal implementation should also consider that new owner is active. We can try that as a follow up. Do you have plans to make the simulation framework support ring change scenarios?

I don't have plans to so personally, I need to move on to a different project, but I do think that's a good idea.

davidporter-id-au added 2 commits July 30, 2024 00:04

WIP commit

58eb90c

fixing some logs

a8240d0

davidporter-id-au changed the title ~~WIP commit~~ Enabling service-discovery driven shutdown of matching engine Jul 30, 2024

Add recover

6e9c59b

taylanisikdemir reviewed Jul 30, 2024

View reviewed changes

service/matching/membership.go Outdated Show resolved Hide resolved

davidporter-id-au added 4 commits August 5, 2024 15:43

Merge branch 'master' into bugfix/enabling-shutdown-through-service-d…

c974255

…iscovery

WIP, snapshotting

6acc273

Moving this to engine

4ff1e16

debugging

9c048eb

davidporter-id-au commented Aug 6, 2024

View reviewed changes

common/errors/taskListNotOwnedByHostError.go Show resolved Hide resolved

Groxx reviewed Aug 6, 2024

View reviewed changes

service/matching/handler/engine.go Outdated Show resolved Hide resolved

Groxx reviewed Aug 6, 2024

View reviewed changes

service/matching/handler/engine.go Outdated Show resolved Hide resolved

Groxx reviewed Aug 6, 2024

View reviewed changes

service/matching/handler/membership.go Outdated Show resolved Hide resolved

Groxx reviewed Aug 6, 2024

View reviewed changes

service/matching/handler/engine.go Outdated Show resolved Hide resolved

Shaddoll reviewed Aug 6, 2024

View reviewed changes

davidporter-id-au added 4 commits August 6, 2024 18:16

wip

165b111

Merge branch 'master' into bugfix/enabling-shutdown-through-service-d…

f3efac6

…iscovery

Fix and add a small amount of coverage

b1b77c3

removing

bc10553

jakobht reviewed Aug 7, 2024

View reviewed changes

davidporter-id-au and others added 4 commits August 8, 2024 00:02

Update service/matching/handler/engine.go

7d4e592

Co-authored-by: Jakob Haahr Taankvist <jht@uber.com>

Update service/matching/handler/engine.go

3236cc8

Co-authored-by: Jakob Haahr Taankvist <jht@uber.com>

fixing some tests

84098f3

Fix config

c21d56a

davidporter-id-au marked this pull request as ready for review August 8, 2024 08:03

davidporter-id-au requested review from neil-xie, shijiesheng, agautam478, 3vilhamster and sankari165 as code owners August 8, 2024 08:03

davidporter-id-au requested review from dkrotx and demirkayaender as code owners August 8, 2024 08:03

davidporter-id-au and others added 4 commits August 8, 2024 01:03

Update service/matching/handler/membership.go

f695c38

Co-authored-by: Jakob Haahr Taankvist <jht@uber.com>

fixing test

4d870b1

lint

9adba98

Merge branch 'master' into bugfix/enabling-shutdown-through-service-d…

8e22562

…iscovery

davidporter-id-au commented Aug 13, 2024

View reviewed changes

service/matching/handler/engine_integration_test.go Show resolved Hide resolved

davidporter-id-au commented Aug 13, 2024

View reviewed changes

jakobht approved these changes Aug 13, 2024

View reviewed changes

taylanisikdemir reviewed Aug 13, 2024

View reviewed changes

davidporter-id-au added 6 commits August 13, 2024 14:20

feedback

2eeaca5

fixing up

40f716d

fix copy of sync wg

2f74aec

coverage

c36df6a

maybe fix data race

15ec5ca

race debugging

25e649b

davidporter-id-au merged commit b1c923e into cadence-workflow:master Aug 15, 2024
19 checks passed

davidporter-id-au deleted the bugfix/enabling-shutdown-through-service-discovery branch August 15, 2024 00:21

taylanisikdemir mentioned this pull request Sep 12, 2024

Always notify subscribers on membership change #6283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling service-discovery driven shutdown of matching engine #6198

Enabling service-discovery driven shutdown of matching engine #6198

davidporter-id-au commented Jul 30, 2024 •

edited

Loading

Shaddoll Aug 6, 2024

davidporter-id-au Aug 13, 2024

taylanisikdemir Aug 13, 2024

davidporter-id-au Aug 14, 2024

jakobht left a comment

davidporter-id-au Aug 13, 2024

codecov bot commented Aug 13, 2024 •

edited

Loading

taylanisikdemir Aug 13, 2024

taylanisikdemir Aug 13, 2024

davidporter-id-au Aug 13, 2024

taylanisikdemir Aug 14, 2024

taylanisikdemir Aug 13, 2024

davidporter-id-au Aug 13, 2024

taylanisikdemir Aug 14, 2024

davidporter-id-au Aug 14, 2024

		// can take a while, so do them in parallel so as to not hold the
		// lock too long.

Enabling service-discovery driven shutdown of matching engine #6198

Enabling service-discovery driven shutdown of matching engine #6198

Conversation

davidporter-id-au commented Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidporter-id-au commented Jul 30, 2024 •

edited

Loading

codecov bot commented Aug 13, 2024 •

edited

Loading